2022-01-29

  • cs.LG updates on arXiv.org

    FinGAN: Generative Adversarial Network for Analytical Customer Relationship Management in Banking and Insurance. (arXiv:2201.11486v1 [cs.CE])
    (2 min) Churn prediction in credit cards, fraud detection in insurance, and loan default prediction are important analytical customer relationship management (ACRM) problems. Since frauds, churns and defaults happen less frequently, the datasets for these problems turn out to be naturally highly unbalanced. Consequently, all supervised machine learning classifiers tend to yield substantial false-positive rates when trained on such unbalanced datasets. We propose two ways of data balancing. In the first, we propose an oversampling method to generate synthetic samples of minority class using Generative Adversarial Network (GAN). We employ Vanilla GAN [1], Wasserstein GAN [2] and CTGAN [3] separately to oversample the minority class samples. In order to assess the efficacy of our proposed approach, we use a host of machine learning classifiers, including Random Forest, Decision Tree, support vector machine (SVM), and Logistic Regression on the data balanced by GANs. In the second method, we introduce a hybrid method to handle data imbalance. In this second way, we utilize the power of undersampling and over-sampling together by augmenting the synthetic minority class data oversampled by GAN with the undersampled majority class data obtained by one-class support vigor machine (OCSVM) [4]. We combine both over-sampled data generated by GAN and the data under-sampled by OCSVM [4] and pass the resultant data to classifiers. When we compared our results to those of Farquad et al. [5], Sundarkumar, Ravi, and Siddeshwar [6], our proposed methods outperform the previous results in terms of the area under the ROC curve (AUC) on all datasets.
    Stochastic First-order Methods for Convex and Nonconvex Functional Constrained Optimization. (arXiv:1908.02734v4 [math.OC] UPDATED)
    (2 min) Functional constrained optimization is becoming more and more important in machine learning and operations research. Such problems have potential applications in risk-averse machine learning, semisupervised learning, and robust optimization among others. In this paper, we first present a novel Constraint Extrapolation (ConEx) method for solving convex functional constrained problems, which utilizes linear approximations of the constraint functions to define the extrapolation (or acceleration) step. We show that this method is a unified algorithm that achieves the best-known rate of convergence for solving different functional constrained convex composite problems, including convex or strongly convex, and smooth or nonsmooth problems with a stochastic objective and/or stochastic constraints. Many of these rates of convergence were in fact obtained for the first time in the literature. In addition, ConEx is a single-loop algorithm that does not involve any penalty subproblems. Contrary to existing primal-dual methods, it does not require the projection of Lagrangian multipliers into a (possibly unknown) bounded set. Second, for nonconvex functional constrained problems, we introduce a new proximal point method that transforms the initial nonconvex problem into a sequence of convex problems by adding quadratic terms to both the objective and constraints. Under a certain MFCQ-type assumption, we establish the convergence and rate of convergence of this method to KKT points when the convex subproblems are solved exactly or inexactly. For large-scale and stochastic problems, we present a more practical proximal point method in which the approximate solutions of the subproblems are computed by the aforementioned ConEx method. To the best of our knowledge, most of these convergence and complexity results of the proximal point method for nonconvex problems also seem to be new in the literature.
    Transfer Portal: Accurately Forecasting the Impact of a Player Transfer in Soccer. (arXiv:2201.11533v1 [cs.LG])
    (2 min) One of the most important and challenging problems in football is predicting future player performance when transferred to another club within and between different leagues. In addition to being the most valuable prediction a team makes, it is also the most complex analytics task to perform as it needs to take into consideration: a) differences in playing style between the player's current team and target team, b) differences in style and ability of other players on each team, c) differences in league quality and style, and d) the role the player is desired to play. In this paper, we present a method which addresses these issues and enables us to make accurate predictions of future performance. Our Transfer Portal model utilizes a personalized neural network accounting for both stylistic and ability level input representations for players, teams, and leagues to simulate future player performance at any chosen club. Furthermore, we use a Bayesian updating framework to dynamically modify player and team representations over time which enables us to generate predictions for rising stars with small amounts of data.
    GraphTune: A Learning-based Graph Generative Model with Tunable Structural Features. (arXiv:2201.11494v1 [cs.LG])
    (2 min) Generative models for graphs have been actively studied for decades, and they have a wide range of applications. Recently, learning-based graph generation that reproduces real-world graphs has gradually attracted the attention of many researchers. Several generative models that utilize modern machine learning technologies have been proposed, though a conditional generation of general graphs is less explored in the field. In this paper, we propose a generative model that allows us to tune a value of a global-level structural feature as a condition. Our model called GraphTune enables to tune a value of any structural feature of generated graphs using Long Short Term Memory (LSTM) and Conditional Variational AutoEncoder (CVAE). We performed comparative evaluations of GraphTune and conventional models with a real graph dataset. The evaluations show that GraphTune enables to clearly tune a value of a global-level structural feature compared to the conventional models.
    Neural Implicit Surfaces in Higher Dimension. (arXiv:2201.09636v2 [cs.LG] UPDATED)
    (2 min) This work investigates the use of neural networks admitting high-order derivatives for modeling dynamic variations of smooth implicit surfaces. For this purpose, it extends the representation of differentiable neural implicit surfaces to higher dimensions, which opens up mechanisms that allow to exploit geometric transformations in many settings, from animation and surface evolution to shape morphing and design galleries. The problem is modeled by a $k$-parameter family of surfaces $S_c$, specified as a neural network function $f : \mathbb{R}^3 \times \mathbb{R}^k \rightarrow \mathbb{R}$, where $S_c$ is the zero-level set of the implicit function $f(\cdot, c) : \mathbb{R}^3 \rightarrow \mathbb{R} $, with $c \in \mathbb{R}^k$, with variations induced by the control variable $c$. In that context, restricted to each coordinate of $\mathbb{R}^k$, the underlying representation is a neural homotopy which is the solution of a general partial differential equation.
    Generative Adversarial Exploration for Reinforcement Learning. (arXiv:2201.11685v1 [cs.LG])
    (2 min) Exploration is crucial for training the optimal reinforcement learning (RL) policy, where the key is to discriminate whether a state visiting is novel. Most previous work focuses on designing heuristic rules or distance metrics to check whether a state is novel without considering such a discrimination process that can be learned. In this paper, we propose a novel method called generative adversarial exploration (GAEX) to encourage exploration in RL via introducing an intrinsic reward output from a generative adversarial network, where the generator provides fake samples of states that help discriminator identify those less frequently visited states. Thus the agent is encouraged to visit those states which the discriminator is less confident to judge as visited. GAEX is easy to implement and of high training efficiency. In our experiments, we apply GAEX into DQN and the DQN-GAEX algorithm achieves convincing performance on challenging exploration problems, including the game Venture, Montezuma's Revenge and Super Mario Bros, without further fine-tuning on complicate learning algorithms. To our knowledge, this is the first work to employ GAN in RL exploration problems.
    Logic of Machine Learning. (arXiv:2006.09500v4 [cs.LG] UPDATED)
    (2 min) The main question is: why and how can we ever predict based on a finite sample? The question is not answered by statistical learning theory. Here, I suggest that prediction requires belief in "predictability" of the underlying dependence, and learning involves search for a hypothesis where these beliefs are violated the least given the observations. The measure of these violations ("errors") for given data, hypothesis and particular type of predictability beliefs is formalized as concept of incongruity in modal Logic of Observations and Hypotheses (LOH). I show on examples of many popular textbook learners (from hierarchical clustering to k-NN and SVM) that each of them minimizes its own version of incongruity. In addition, the concept of incongruity is shown to be flexible enough for formalization of some important data analysis problems, not considered as part of ML.
    Early Detection of Network Attacks Using Deep Learning. (arXiv:2201.11628v1 [cs.CR])
    (2 min) The Internet has become a prime subject to security attacks and intrusions by attackers. These attacks can lead to system malfunction, network breakdown, data corruption or theft. A network intrusion detection system (IDS) is a tool used for identifying unauthorized and malicious behavior by observing the network traffic. State-of-the-art intrusion detection systems are designed to detect an attack by inspecting the complete information about the attack. This means that an IDS would only be able to detect an attack after it has been executed on the system under attack and might have caused damage to the system. In this paper, we propose an end-to-end early intrusion detection system to prevent network attacks before they could cause any more damage to the system under attack while preventing unforeseen downtime and interruption. We employ a deep neural network-based classifier for attack identification. The network is trained in a supervised manner to extract relevant features from raw network traffic data instead of relying on a manual feature selection process used in most related approaches. Further, we introduce a new metric, called earliness, to evaluate how early our proposed approach detects attacks. We have empirically evaluated our approach on the CICIDS2017 dataset. The results show that our approach performed well and attained an overall 0.803 balanced accuracy.
    DropNAS: Grouped Operation Dropout for Differentiable Architecture Search. (arXiv:2201.11679v1 [cs.LG])
    (2 min) Neural architecture search (NAS) has shown encouraging results in automating the architecture design. Recently, DARTS relaxes the search process with a differentiable formulation that leverages weight-sharing and SGD where all candidate operations are trained simultaneously. Our empirical results show that such procedure results in the co-adaption problem and Matthew Effect: operations with fewer parameters would be trained maturely earlier. This causes two problems: firstly, the operations with more parameters may never have the chance to express the desired function since those with less have already done the job; secondly, the system will punish those underperforming operations by lowering their architecture parameter, and they will get smaller loss gradients, which causes the Matthew Effect. In this paper, we systematically study these problems and propose a novel grouped operation dropout algorithm named DropNAS to fix the problems with DARTS. Extensive experiments demonstrate that DropNAS solves the above issues and achieves promising performance. Specifically, DropNAS achieves 2.26% test error on CIFAR-10, 16.39% on CIFAR-100 and 23.4% on ImageNet (with the same training hyperparameters as DARTS for a fair comparison). It is also observed that DropNAS is robust across variants of the DARTS search space. Code is available at https://github.com/wiljohnhong/DropNAS.
    Setting AI in context: A case study on defining the context and operational design domain for automated driving. (arXiv:2201.11451v1 [cs.SE])
    (2 min) [Context and motivation] For automated driving systems, the operational context needs to be known in order to state guarantees on performance and safety. The operational design domain (ODD) is an abstraction of the operational context, and its definition is an integral part of the system development process. [Question / problem] There are still major uncertainties in how to clearly define and document the operational context in a diverse and distributed development environment such as the automotive industry. This case study investigates the challenges with context definitions for the development of perception functions that use machine learning for automated driving. [Principal ideas/results] Based on qualitative analysis of data from semi-structured interviews, the case study shows that there is a lack of standardisation for context definitions across the industry, ambiguities in the processes that lead to deriving the ODD, missing documentation of assumptions about the operational context, and a lack of involvement of function developers in the context definition. [Contribution] The results outline challenges experienced by an automotive supplier company when defining the operational context for systems using machine learning. Furthermore, the study collected ideas for potential solutions from the perspective of practitioners.
    Restarted Nonconvex Accelerated Gradient Descent: No More Polylogarithmic Factor in the $O(\epsilon^{-7/4})$ Complexity. (arXiv:2201.11411v1 [math.OC])
    (2 min) This paper studies the accelerated gradient descent for general nonconvex problems under the gradient Lipschitz and Hessian Lipschitz assumptions. We establish that a simple restarted accelerated gradient descent (AGD) finds an $\epsilon$-approximate first-order stationary point in $O(\epsilon^{-7/4})$ gradient computations with simple proofs. Our complexity does not hide any polylogarithmic factors, and thus it improves over the state-of-the-art one by the $O(\log\frac{1}{\epsilon})$ factor. Our simple algorithm only consists of Nesterov's classical AGD and a restart mechanism, and it does not need the negative curvature exploitation or the optimization of regularized surrogate functions. Technically, our simple proof does not invoke the analysis for the strongly convex AGD, which is crucial to remove the $O(\log\frac{1}{\epsilon})$ factor.
    Multi-Frame Quality Enhancement On Compressed Video Using Quantised Data of Deep Belief Networks. (arXiv:2201.11389v1 [eess.IV])
    (2 min) In the age of streaming and surveillance compressed video enhancement has become a problem in need of constant improvement. Here, we investigate a way of improving the Multi-Frame Quality Enhancement approach. This approach consists of making use of the frames that have the peak quality in the region to improve those that have a lower quality in that region. This approach consists of obtaining quantized data from the videos using a deep belief network. The quantized data is then fed into the MF-CNN architecture to improve the compressed video. We further investigate the impact of using a Bi-LSTM for detecting the peak quality frames. Our approach obtains better results than the first approach of the MFQE which uses an SVM for PQF detection. On the other hand, our MFQE approach does not outperform the latest version of the MQFE approach that uses a Bi-LSTM for PQF detection.
    Few-shot Transfer Learning for Holographic Image Reconstruction using a Recurrent Neural Network. (arXiv:2201.11333v1 [eess.IV])
    (2 min) Deep learning-based methods in computational microscopy have been shown to be powerful but in general face some challenges due to limited generalization to new types of samples and requirements for large and diverse training data. Here, we demonstrate a few-shot transfer learning method that helps a holographic image reconstruction deep neural network rapidly generalize to new types of samples using small datasets. We pre-trained a convolutional recurrent neural network on a large dataset with diverse types of samples, which serves as the backbone model. By fixing the recurrent blocks and transferring the rest of the convolutional blocks of the pre-trained model, we reduced the number of trainable parameters by ~90% compared with standard transfer learning, while achieving equivalent generalization. We validated the effectiveness of this approach by successfully generalizing to new types of samples using small holographic datasets for training, and achieved (i) ~2.5-fold convergence speed acceleration, (ii) ~20% computation time reduction per epoch, and (iii) improved reconstruction performance over baseline network models trained from scratch. This few-shot transfer learning approach can potentially be applied in other microscopic imaging methods, helping to generalize to new types of samples without the need for extensive training time and data.
    LAGOON: An Analysis Tool for Open Source Communities. (arXiv:2201.11657v1 [cs.SI])
    (2 min) This paper presents LAGOON -- an open source platform for understanding the complex ecosystems of Open Source Software (OSS) communities. The platform currently utilizes spatiotemporal graphs to store and investigate the artifacts produced by these communities, and help analysts identify bad actors who might compromise an OSS project's security. LAGOON provides ingest of artifacts from several common sources, including source code repositories, issue trackers, mailing lists and scraping content from project websites. Ingestion utilizes a modular architecture, which supports incremental updates from data sources and provides a generic identity fusion process that can recognize the same community members across disparate accounts. A user interface is provided for visualization and exploration of an OSS project's complete sociotechnical graph. Scripts are provided for applying machine learning to identify patterns within the data. While current focus is on the identification of bad actors in the Python community, the platform's reusability makes it easily extensible with new data and analyses, paving the way for LAGOON to become a comprehensive means of assessing various OSS-based projects and their communities.
    Neural basis expansion analysis with exogenous variables: Forecasting electricity prices with NBEATSx. (arXiv:2104.05522v5 [cs.LG] UPDATED)
    (2 min) We extend the neural basis expansion analysis (NBEATS) to incorporate exogenous factors. The resulting method, called NBEATSx, improves on a well performing deep learning model, extending its capabilities by including exogenous variables and allowing it to integrate multiple sources of useful information. To showcase the utility of the NBEATSx model, we conduct a comprehensive study of its application to electricity price forecasting (EPF) tasks across a broad range of years and markets. We observe state-of-the-art performance, significantly improving the forecast accuracy by nearly 20% over the original NBEATS model, and by up to 5% over other well established statistical and machine learning methods specialized for these tasks. Additionally, the proposed neural network has an interpretable configuration that can structurally decompose time series, visualizing the relative impact of trend and seasonal components and revealing the modeled processes' interactions with exogenous factors. To assist related work we made the code available in https://github.com/cchallu/nbeatsx.
    Image Colorization: A Survey and Dataset. (arXiv:2008.10774v3 [cs.CV] UPDATED)
    (2 min) Image colorization is the process of estimating RGB colors for grayscale images or video frames to improve their aesthetic and perceptual quality. Deep learning techniques for image colorization have progressed notably over the last decade, calling the need for a systematic survey and benchmarking of these techniques. This article presents a comprehensive survey of recent state-of-the-art deep learning-based image colorization techniques, describing their fundamental block architectures, inputs, optimizers, loss functions, training protocols, and training data \textit{etc.} It categorizes the existing colorization techniques into seven classes and discusses important factors governing their performance, such as benchmark datasets and evaluation metrics. We highlight the limitations of existing datasets and introduce a new dataset specific to colorization. Using the existing datasets and our new one, we perform an extensive experimental evaluation of existing image colorization methods. Finally, we discuss the limitations of existing methods and recommend possible solutions as well as future research directions for this rapidly evolving topic of deep image colorization. Dataset and codes for evaluation are publicly available at https://github.com/saeed-anwar/ColorSurvey
    A Machine Learning-based Characterization Framework for Parametric Representation of Nonlinear Sloshing. (arXiv:2201.11663v1 [cs.LG])
    (2 min) The growing interest in creating a parametric representation of liquid sloshing inside a container stems from its practical applications in modern engineering systems. The resonant excitation, on the other hand, can cause unstable and nonlinear water waves, resulting in chaotic motions and non-Gaussian signals. This paper presents a novel machine learning-based framework for nonlinear liquid sloshing representation learning. The proposed method is a parametric modeling technique that is based on sequential learning and sparse regularization. The dynamics are categorized into two parts: linear evolution and nonlinear forcing. The former advances the dynamical system in time on an embedded manifold, while the latter causes divergent behaviors in temporal evolution, such as bursting and switching. The proposed framework's merit is demonstrated using an experimental dataset of liquid sloshing in a tank under horizontal excitation with a wide frequency range and various vertical slat screen settings.
    Spectral Analysis and Fixed Point Stability of Deep Neural Dynamics. (arXiv:2011.13492v2 [cs.LG] UPDATED)
    (2 min) In this work, we analyze the eigenvalue spectra and stability of discrete-time dynamical systems parametrized by deep neural networks. In particular, we leverage a representation of deep neural networks as pointwise affine maps, thus exposing their local linear operators and making them accessible to classical system analytic methods.The view of neural networks as affine parameter varying maps allows us to "crack open the black box" of neural network dynamical behavior by visualizing stationary points, state-space partitioning, and eigenvalue spectra. We provide sufficient conditions for the fixed-point stability of discrete-time deep neural dynamical systems. Empirically, we analyze the variance in dynamical behavior and eigenvalue spectra of local linear operators of neural dynamics with varying weight factorizations, activation functions, bias terms, and depths.
    Vision Checklist: Towards Testable Error Analysis of Image Models to Help System Designers Interrogate Model Capabilities. (arXiv:2201.11674v1 [cs.CV])
    (2 min) Using large pre-trained models for image recognition tasks is becoming increasingly common owing to the well acknowledged success of recent models like vision transformers and other CNN-based models like VGG and Resnet. The high accuracy of these models on benchmark tasks has translated into their practical use across many domains including safety-critical applications like autonomous driving and medical diagnostics. Despite their widespread use, image models have been shown to be fragile to changes in the operating environment, bringing their robustness into question. There is an urgent need for methods that systematically characterise and quantify the capabilities of these models to help designers understand and provide guarantees about their safety and robustness. In this paper, we propose Vision Checklist, a framework aimed at interrogating the capabilities of a model in order to produce a report that can be used by a system designer for robustness evaluations. This framework proposes a set of perturbation operations that can be applied on the underlying data to generate test samples of different types. The perturbations reflect potential changes in operating environments, and interrogate various properties ranging from the strictly quantitative to more qualitative. Our framework is evaluated on multiple datasets like Tinyimagenet, CIFAR10, CIFAR100 and Camelyon17 and for models like ViT and Resnet. Our Vision Checklist proposes a specific set of evaluations that can be integrated into the previously proposed concept of a model card. Robustness evaluations like our checklist will be crucial in future safety evaluations of visual perception modules, and be useful for a wide range of stakeholders including designers, deployers, and regulators involved in the certification of these systems. Source code of Vision Checklist would be open for public use.
    Emojis predict dropouts of remote workers: An empirical study of emoji usage on GitHub. (arXiv:2102.05737v2 [cs.LG] UPDATED)
    (2 min) Emotions at work have long been identified as critical signals of work motivations, status, and attitudes, and as predictors of various work-related outcomes. When more and more employees work remotely, these emotional signals of workers become harder to observe through daily, face-to-face communications. The use of online platforms to communicate and collaborate at work provides an alternative channel to monitor the emotions of workers. This paper studies how emojis, as non-verbal cues in online communications, can be used for such purposes and how the emotional signals in emoji usage can be used to predict future behavior of workers. In particular, we present how the developers on GitHub use emojis in their work-related activities. We show that developers have diverse patterns of emoji usage, which can be related to their working status including activity levels, types of work, types of communications, time management, and other behavioral patterns. Developers who use emojis in their posts are significantly less likely to dropout from the online work platform. Surprisingly, solely using emoji usage as features, standard machine learning models can predict future dropouts of developers at a satisfactory accuracy. Features related to the general use and the emotions of emojis appear to be important factors, while they do not rule out paths through other purposes of emoji use.
    Critical Initialization of Wide and Deep Neural Networks through Partial Jacobians: General Theory and Applications. (arXiv:2111.12143v3 [cs.LG] UPDATED)
    (2 min) Deep neural networks are notorious for defying theoretical treatment. However, when the number of parameters in each layer tends to infinity the network function is a Gaussian process (GP) and quantitatively predictive description is possible. Gaussian approximation allows to formulate criteria for selecting hyperparameters, such as variances of weights and biases, as well as the learning rate. These criteria rely on the notion of criticality defined for deep neural networks. In this work we describe a new practical way to diagnose criticality. We introduce \emph{partial Jacobians} of a network, defined as derivatives of preactivations in layer $l$ with respect to preactivations in layer $l_0\leq l$. We derive recurrence relations for the norms of partial Jacobians and utilize these relations to analyze criticality of deep fully connected neural networks with LayerNorm and/or residual connections. We derive and implement a simple and cheap numerical test that allows to select optimal initialization for a broad class of deep neural networks. Using these tools we show quantitatively that proper stacking of the LayerNorm (applied to preactivations) and residual connections leads to an architecture that is critical for any initialization. Finally, we apply our methods to analyze the MLP-Mixer architecture and show that it is everywhere critical.
    First-Order Regret in Reinforcement Learning with Linear Function Approximation: A Robust Estimation Approach. (arXiv:2112.03432v2 [cs.LG] UPDATED)
    (2 min) Obtaining first-order regret bounds -- regret bounds scaling not as the worst-case but with some measure of the performance of the optimal policy on a given instance -- is a core question in sequential decision-making. While such bounds exist in many settings, they have proven elusive in reinforcement learning with large state spaces. In this work we address this gap, and show that it is possible to obtain regret scaling as $\mathcal{O}(\sqrt{V_1^\star K})$ in reinforcement learning with large state spaces, namely the linear MDP setting. Here $V_1^\star$ is the value of the optimal policy and $K$ is the number of episodes. We demonstrate that existing techniques based on least squares estimation are insufficient to obtain this result, and instead develop a novel robust self-normalized concentration bound based on the robust Catoni mean estimator, which may be of independent interest.
    Improved Complexities for Stochastic Conditional Gradient Methods under Interpolation-like Conditions. (arXiv:2006.08167v2 [math.OC] UPDATED)
    (2 min) We analyze stochastic conditional gradient methods for constrained optimization problems arising in over-parametrized machine learning. We show that one could leverage the interpolation-like conditions satisfied by such models to obtain improved oracle complexities. Specifically, when the objective function is convex, we show that the conditional gradient method requires $\mathcal{O}(\epsilon^{-2})$ calls to the stochastic gradient oracle to find an $\epsilon$-optimal solution. Furthermore, by including a gradient sliding step, we show that the number of calls reduces to $\mathcal{O}(\epsilon^{-1.5})$.
    CSI Clustering with Variational Autoencoding. (arXiv:2111.09758v2 [eess.SP] UPDATED)
    (2 min) The model order of a wireless channel plays an important role for a variety of applications in communications engineering, e.g., it represents the number of resolvable incident wavefronts with non-negligible power incident from a transmitter to a receiver. Areas such as direction of arrival estimation leverage the model order to analyze the multipath components of channel state information. In this work, we propose to use a variational autoencoder to group unlabeled channel state information with respect to the model order in the variational autoencoder latent space in an unsupervised manner. We validate our approach with simulated 3GPP channel data. Our results suggest that, in order to learn an appropriate clustering, it is crucial to use a more flexible likelihood model for the variational autoencoder decoder than it is usually the case in standard applications.
    Multi-View Self-Attention Based Transformer for Speaker Recognition. (arXiv:2110.05036v2 [eess.AS] UPDATED)
    (2 min) Initially developed for natural language processing (NLP), Transformer model is now widely used for speech processing tasks such as speaker recognition, due to its powerful sequence modeling capabilities. However, conventional self-attention mechanisms are originally designed for modeling textual sequence without considering the characteristics of speech and speaker modeling. Besides, different Transformer variants for speaker recognition have not been well studied. In this work, we propose a novel multi-view self-attention mechanism and present an empirical study of different Transformer variants with or without the proposed attention mechanism for speaker recognition. Specifically, to balance the capabilities of capturing global dependencies and modeling the locality, we propose a multi-view self-attention mechanism for speaker Transformer, in which different attention heads can attend to different ranges of the receptive field. Furthermore, we introduce and compare five Transformer variants with different network architectures, embedding locations, and pooling methods to learn speaker embeddings. Experimental results on the VoxCeleb1 and VoxCeleb2 datasets show that the proposed multi-view self-attention mechanism achieves improvement in the performance of speaker recognition, and the proposed speaker Transformer network attains excellent results compared with state-of-the-art models.
    Unsupervised Change Detection using DRE-CUSUM. (arXiv:2201.11678v1 [cs.LG])
    (2 min) This paper presents DRE-CUSUM, an unsupervised density-ratio estimation (DRE) based approach to determine statistical changes in time-series data when no knowledge of the pre-and post-change distributions are available. The core idea behind the proposed approach is to split the time-series at an arbitrary point and estimate the ratio of densities of distribution (using a parametric model such as a neural network) before and after the split point. The DRE-CUSUM change detection statistic is then derived from the cumulative sum (CUSUM) of the logarithm of the estimated density ratio. We present a theoretical justification as well as accuracy guarantees which show that the proposed statistic can reliably detect statistical changes, irrespective of the split point. While there have been prior works on using density ratio based methods for change detection, to the best of our knowledge, this is the first unsupervised change detection approach with a theoretical justification and accuracy guarantees. The simplicity of the proposed framework makes it readily applicable in various practical settings (including high-dimensional time-series data); we also discuss generalizations for online change detection. We experimentally show the superiority of DRE-CUSUM using both synthetic and real-world datasets over existing state-of-the-art unsupervised algorithms (such as Bayesian online change detection, its variants as well as several other heuristic methods).
    Graph Attention Networks for Channel Estimation in RIS-assisted Satellite IoT Communications. (arXiv:2104.00735v2 [cs.NI] UPDATED)
    (2 min) Direct-to-satellite (DtS) communication has gained importance recently to support globally connected Internet of things (IoT) networks. However, relatively long distances of densely deployed satellite networks around the Earth cause a high path loss. In addition, since high complexity operations such as beamforming, tracking and equalization have to be performed in IoT devices partially, both the hardware complexity and the need for high-capacity batteries of IoT devices increase. The reconfigurable intelligent surfaces (RISs) have the potential to increase the energy-efficiency and to perform complex signal processing over the transmission environment instead of IoT devices. But, RISs need the information of the cascaded channel in order to change the phase of the incident signal. This study proposes graph attention networks (GATs) for the challenging channel estimation problem and examines the performance of DtS IoT networks for different RIS configurations under GAT channel estimation. It is shown that the proposed GAT both demonstrates a higher performance with increased robustness under changing conditions and has lower computational complexity compared to conventional deep learning methods. Moreover, bit error rate performance is investigated for RIS designs with discrete and non-uniform phase shifts under channel estimation based on the proposed method. One of the findings in this study is that the channel models of the operating environment and the performance of the channel estimation method must be considered during RIS design to exploit performance improvement as far as possible.
    Distributed Proximal Splitting Algorithms with Rates and Acceleration. (arXiv:2010.00952v3 [math.OC] UPDATED)
    (2 min) We analyze several generic proximal splitting algorithms well suited for large-scale convex nonsmooth optimization. We derive sublinear and linear convergence results with new rates on the function value suboptimality or distance to the solution, as well as new accelerated versions, using varying stepsizes. In addition, we propose distributed variants of these algorithms, which can be accelerated as well. While most existing results are ergodic, our nonergodic results significantly broaden our understanding of primal-dual optimization algorithms.
    FOD-A: A Dataset for Foreign Object Debris in Airports. (arXiv:2110.03072v2 [cs.CV] UPDATED)
    (2 min) Foreign Object Debris (FOD) detection has attracted increased attention in the area of machine learning and computer vision. However, a robust and publicly available image dataset for FOD has not been initialized. To this end, this paper introduces an image dataset of FOD, named FOD in Airports (FOD-A). FOD-A object categories have been selected based on guidance from prior documentation and related research by the Federal Aviation Administration (FAA). In addition to the primary annotations of bounding boxes for object detection, FOD-A provides labeled environmental conditions. As such, each annotation instance is further categorized into three light level categories (bright, dim, and dark) and two weather categories (dry and wet). Currently, FOD-A has released 31 object categories and over 30,000 annotation instances. This paper presents the creation methodology, discusses the publicly available dataset extension process, and demonstrates the practicality of FOD-A with widely used machine learning models for object detection.
    MALTS: Matching After Learning to Stretch. (arXiv:1811.07415v6 [stat.ME] CROSS LISTED)
    (2 min) We introduce a flexible framework that produces high-quality almost-exact matches for causal inference. Most prior work in matching uses ad-hoc distance metrics, often leading to poor quality matches, particularly when there are irrelevant covariates. In this work, we learn an interpretable distance metric for matching, which leads to substantially higher quality matches. The learned distance metric stretches the covariate space according to each covariate's contribution to outcome prediction: this stretching means that mismatches on important covariates carry a larger penalty than mismatches on irrelevant covariates. Our ability to learn flexible distance metrics leads to matches that are interpretable and useful for the estimation of conditional average treatment effects.
    HYPERLOCK: In-Memory Hyperdimensional Encryption in Memristor Crossbar Array. (arXiv:2201.11362v1 [cs.CR])
    (2 min) We present a novel cryptography architecture based on memristor crossbar array, binary hypervectors, and neural network. Utilizing the stochastic and unclonable nature of memristor crossbar and error tolerance of binary hypervectors and neural network, implementation of the algorithm on memristor crossbar simulation is made possible. We demonstrate that with an increasing dimension of the binary hypervectors, the non-idealities in the memristor circuit can be effectively controlled. At the fine level of controlled crossbar non-ideality, noise from memristor circuit can be used to encrypt data while being sufficiently interpretable by neural network for decryption. We applied our algorithm on image cryptography for proof of concept, and to text en/decryption with 100% decryption accuracy despite crossbar noises. Our work shows the potential and feasibility of using memristor crossbars as an unclonable stochastic encoder unit of cryptography on top of their existing functionality as a vector-matrix multiplication acceleration device.
    Monitoring Model Deterioration with Explainable Uncertainty Estimation via Non-parametric Bootstrap. (arXiv:2201.11676v1 [cs.LG])
    (2 min) Monitoring machine learning models once they are deployed is challenging. It is even more challenging to decide when to retrain models in real-case scenarios when labeled data is beyond reach, and monitoring performance metrics becomes unfeasible. In this work, we use non-parametric bootstrapped uncertainty estimates and SHAP values to provide explainable uncertainty estimation as a technique that aims to monitor the deterioration of machine learning models in deployment environments, as well as determine the source of model deterioration when target labels are not available. Classical methods are purely aimed at detecting distribution shift, which can lead to false positives in the sense that the model has not deteriorated despite a shift in the data distribution. To estimate model uncertainty we construct prediction intervals using a novel bootstrap method, which improves upon the work of Kumar & Srivastava (2012). We show that both our model deterioration detection system as well as our uncertainty estimation method achieve better performance than the current state-of-the-art. Finally, we use explainable AI techniques to gain an understanding of the drivers of model deterioration. We release an open source Python package, doubt, which implements our proposed methods, as well as the code used to reproduce our experiments.
    OneFlow: Redesign the Distributed Deep Learning Framework from Scratch. (arXiv:2110.15032v3 [cs.DC] UPDATED)
    (2 min) Deep learning frameworks such as TensorFlow and PyTorch provide a productive interface for expressing and training a deep neural network (DNN) model on a single device or using data parallelism. Still, they may not be flexible or efficient enough in training emerging large models on distributed devices, which require more sophisticated parallelism beyond data parallelism. Plugins or wrappers have been developed to strengthen these frameworks for model or pipeline parallelism, but they complicate the usage and implementation of distributed deep learning. Aiming at a simple, neat redesign of distributed deep learning frameworks for various parallelism paradigms, we present OneFlow, a novel distributed training framework based on an SBP (split, broadcast and partial-value) abstraction and the actor model. SBP enables much easier programming of data parallelism and model parallelism than existing frameworks, and the actor model provides a succinct runtime mechanism to manage the complex dependencies imposed by resource constraints, data movement and computation in distributed deep learning. We demonstrate the general applicability and efficiency of OneFlow for training various large DNN models with case studies and extensive experiments. The results show that OneFlow outperforms many well-known customized libraries built on top of the state-of-the-art frameworks. The code of OneFlow is available at: https://github.com/Oneflow-Inc/oneflow.
    Greedy structure learning from data that contains systematic missing values. (arXiv:2107.04184v2 [cs.LG] UPDATED)
    (2 min) Learning from data that contain missing values represents a common phenomenon in many domains. Relatively few Bayesian Network structure learning algorithms account for missing data, and those that do tend to rely on standard approaches that assume missing data are missing at random, such as the Expectation-Maximisation algorithm. Because missing data are often systematic, there is a need for more pragmatic methods that can effectively deal with data sets containing missing values not missing at random. The absence of approaches that deal with systematic missing data impedes the application of BN structure learning methods to real-world problems where missingness are not random. This paper describes three variants of greedy search structure learning that utilise pairwise deletion and inverse probability weighting to maximally leverage the observed data and to limit potential bias caused by missing values. The first two of the variants can be viewed as sub-versions of the third and best performing variant, but are important in their own in illustrating the successive improvements in learning accuracy. The empirical investigations show that the proposed approach outperforms the commonly used and state-of-the-art Structural EM algorithm, both in terms of learning accuracy and efficiency, as well as both when data are missing at random and not at random.
    Generating Adjacency Matrix for Video Relocalization. (arXiv:2008.08977v2 [cs.CV] UPDATED)
    (2 min) In this paper, we continue our work on video relocalization task. Based on using graph convolution to extract intra-video and inter-video frame features, we improve the method by using similarity-metric based graph convolution, whose weighted adjacency matrix is achieved by calculating similarity metric between features of any two different time steps in the graph. Experiments on ActivityNet v1.2 and Thumos14 dataset show the effectiveness of this improvement, and it outperforms the state-of-the-art methods.
    Multi-Attribute Balanced Sampling for Disentangled GAN Controls. (arXiv:2111.00909v2 [cs.LG] CROSS LISTED)
    (2 min) Various controls over the generated data can be extracted from the latent space of a pre-trained GAN, as it implicitly encodes the semantics of the training data. The discovered controls allow to vary semantic attributes in the generated images but usually lead to entangled edits that affect multiple attributes at the same time. Supervised approaches typically sample and annotate a collection of latent codes, then train classifiers in the latent space to identify the controls. Since the data generated by GANs reflects the biases of the original dataset, so do the resulting semantic controls. We propose to address disentanglement by subsampling the generated data to remove over-represented co-occuring attributes thus balancing the semantics of the dataset before training the classifiers. We demonstrate the effectiveness of this approach by extracting disentangled linear directions for face manipulation on two popular GAN architectures, PGGAN and StyleGAN, and two datasets, CelebAHQ and FFHQ. We show that this approach outperforms state-of-the-art classifier-based methods while avoiding the need for disentanglement-enforcing post-processing.
    Exploration in Deep Reinforcement Learning: A Comprehensive Survey. (arXiv:2109.06668v3 [cs.AI] UPDATED)
    (3 min) Deep Reinforcement Learning (DRL) and Deep Multi-agent Reinforcement Learning (MARL) have achieved significant success across a wide range of domains, such as game AI, autonomous vehicles, robotics and finance. However, DRL and deep MARL agents are widely known to be sample-inefficient and millions of interactions are usually needed even for relatively simple game settings, thus preventing the wide application in real-industry scenarios. One bottleneck challenge behind is the well-known exploration problem, i.e., how to efficiently explore the unknown environments and collect informative experiences that could benefit the policy learning most. In this paper, we conduct a comprehensive survey on existing exploration methods in DRL and deep MARL for the purpose of providing understandings and insights on the critical problems and solutions. We first identify several key challenges to achieve efficient exploration, which most of the exploration methods aim at addressing. Then we provide a systematic survey of existing approaches by classifying them into two major categories: uncertainty-oriented exploration and intrinsic motivation-oriented exploration. The essence of uncertainty-oriented exploration is to leverage the quantification of the epistemic and aleatoric uncertainty to derive efficient exploration. By contrast, intrinsic motivation-oriented exploration methods usually incorporate different reward agnostic information for intrinsic exploration guidance. Beyond the above two main branches, we also conclude other exploration methods which adopt sophisticated techniques but are difficult to be classified into the above two categories. In addition, we provide a comprehensive empirical comparison of exploration methods for DRL on a set of commonly used benchmarks. Finally, we summarize the open problems of exploration in DRL and deep MARL and point out a few future directions.
    Generalized Radiograph Representation Learning via Cross-supervision between Images and Free-text Radiology Reports. (arXiv:2111.03452v2 [eess.IV] UPDATED)
    (2 min) Pre-training lays the foundation for recent successes in radiograph analysis supported by deep learning. It learns transferable image representations by conducting large-scale fully-supervised or self-supervised learning on a source domain. However, supervised pre-training requires a complex and labor intensive two-stage human-assisted annotation process while self-supervised learning cannot compete with the supervised paradigm. To tackle these issues, we propose a cross-supervised methodology named REviewing FreE-text Reports for Supervision (REFERS), which acquires free supervision signals from original radiology reports accompanying the radiographs. The proposed approach employs a vision transformer and is designed to learn joint representations from multiple views within every patient study. REFERS outperforms its transfer learning and self-supervised learning counterparts on 4 well-known X-ray datasets under extremely limited supervision. Moreover, REFERS even surpasses methods based on a source domain of radiographs with human-assisted structured labels. Thus REFERS has the potential to replace canonical pre-training methodologies.
    A Quantum-like Model for Predicting Human Decisions in the Entangled Social Systems. (arXiv:2111.13902v2 [physics.soc-ph] UPDATED)
    (2 min) Human-centered systems of systems such as social networks, Internet of Things, or healthcare systems are growingly becoming major facets of modern life. Realistic models of human behavior in such systems play a significant role in their accurate modeling and prediction. Yet, human behavior under uncertainty often violates the predictions by the conventional probabilistic models. Recently, quantum-like decision theories have shown a considerable potential to explain the contradictions in human behavior by applying quantum probability. But providing a quantum-like decision theory that could predict, rather than describe the current, state of human behavior is still one of the unsolved challenges. The main novelty of our approach is introducing an entangled Bayesian network inspired by the entanglement concept in quantum information theory, in which each human is a part of the entire society. Accordingly, society's effect on the dynamic evolution of the decision-making process, which is less often considered in decision theories, is modeled by the entanglement measures. The proposed predictive entangled quantum-like Bayesian network (PEQBN) is evaluated on 22 experimental tasks. Results confirm that PEQBN provides more realistic predictions of human decisions under uncertainty, when compared with classical Bayesian networks and three recent quantum-like approaches.
    Data-Aware Device Scheduling for Federated Edge Learning. (arXiv:2102.09491v2 [cs.DC] UPDATED)
    (2 min) Federated Edge Learning (FEEL) involves the collaborative training of machine learning models among edge devices, with the orchestration of a server in a wireless edge network. Due to frequent model updates, FEEL needs to be adapted to the limited communication bandwidth, scarce energy of edge devices, and the statistical heterogeneity of edge devices' data distributions. Therefore, a careful scheduling of a subset of devices for training and uploading models is necessary. In contrast to previous work in FEEL where the data aspects are under-explored, we consider data properties at the heart of the proposed scheduling algorithm. To this end, we propose a new scheduling scheme for non-independent and-identically-distributed (non-IID) and unbalanced datasets in FEEL. As the data is the key component of the learning, we propose a new set of considerations for data characteristics in wireless scheduling algorithms in FEEL. In fact, the data collected by the devices depends on the local environment and usage pattern. Thus, the datasets vary in size and distributions among the devices. In the proposed algorithm, we consider both data and resource perspectives. In addition to minimizing the completion time of FEEL as well as the transmission energy of the participating devices, the algorithm prioritizes devices with rich and diverse datasets. We first define a general framework for the data-aware scheduling and the main axes and requirements for diversity evaluation. Then, we discuss diversity aspects and some exploitable techniques and metrics. Next, we formulate the problem and present our FEEL scheduling algorithm. Evaluations in different scenarios show that our proposed FEEL scheduling algorithm can help achieve high accuracy in few rounds with a reduced cost.
    Robust Augmentation for Multivariate Time Series Classification. (arXiv:2201.11739v1 [cs.LG])
    (2 min) Neural networks are capable of learning powerful representations of data, but they are susceptible to overfitting due to the number of parameters. This is particularly challenging in the domain of time series classification, where datasets may contain fewer than 100 training examples. In this paper, we show that the simple methods of cutout, cutmix, mixup, and window warp improve the robustness and overall performance in a statistically significant way for convolutional, recurrent, and self-attention based architectures for time series classification. We evaluate these methods on 26 datasets from the University of East Anglia Multivariate Time Series Classification (UEA MTSC) archive and analyze how these methods perform on different types of time series data.. We show that the InceptionTime network with augmentation improves accuracy by 1% to 45% in 18 different datasets compared to without augmentation. We also show that augmentation improves accuracy for recurrent and self attention based architectures.
    Temporal Prediction and Evaluation of Brassica Growth in the Field using Conditional Generative Adversarial Networks. (arXiv:2105.07789v2 [cs.CV] UPDATED)
    (2 min) Farmers frequently assess plant growth and performance as basis for making decisions when to take action in the field, such as fertilization, weed control, or harvesting. The prediction of plant growth is a major challenge, as it is affected by numerous and highly variable environmental factors. This paper proposes a novel monitoring approach that comprises high-throughput imaging sensor measurements and their automatic analysis to predict future plant growth. Our approach's core is a novel machine learning-based generative growth model based on conditional generative adversarial networks, which is able to predict the future appearance of individual plants. In experiments with RGB time-series images of laboratory-grown Arabidopsis thaliana and field-grown cauliflower plants, we show that our approach produces realistic, reliable, and reasonable images of future growth stages. The automatic interpretation of the generated images through neural network-based instance segmentation allows the derivation of various phenotypic traits that describe plant growth.
    Learning to Aggregate and Refine Noisy Labels for Visual Sentiment Analysis. (arXiv:2109.07509v2 [cs.CV] UPDATED)
    (2 min) Visual sentiment analysis has received increasing attention in recent years. However, the dataset's quality is a concern because the sentiment labels are crowd-sourcing, subjective, and prone to mistakes, and poses a severe threat to the data-driven models, especially the deep neural networks. The deep models would generalize poorly on the testing cases when trained to over-fit the training samples with noisy sentiment labels. Inspired by the recent progress on learning with noisy labels, we propose a robust learning method to perform robust visual sentiment analysis. Our method relies on external memory to aggregate and filters noisy labels during training. The memory is composed of the prototypes with corresponding labels, which can be updated online. The learned prototypes and their labels can be regarded as denoising features and labels for the local regions and can guide the training process to prevent the model from overfitting the noisy cases. We establish a benchmark for visual sentiment analysis with label noise using publicly available datasets. The experiment results of the proposed benchmark settings comprehensively show the effectiveness of our method.
    TrustAL: Trustworthy Active Learning using Knowledge Distillation. (arXiv:2201.11661v1 [cs.LG])
    (2 min) Active learning can be defined as iterations of data labeling, model training, and data acquisition, until sufficient labels are acquired. A traditional view of data acquisition is that, through iterations, knowledge from human labels and models is implicitly distilled to monotonically increase the accuracy and label consistency. Under this assumption, the most recently trained model is a good surrogate for the current labeled data, from which data acquisition is requested based on uncertainty/diversity. Our contribution is debunking this myth and proposing a new objective for distillation. First, we found example forgetting, which indicates the loss of knowledge learned across iterations. Second, for this reason, the last model is no longer the best teacher -- For mitigating such forgotten knowledge, we select one of its predecessor models as a teacher, by our proposed notion of "consistency". We show that this novel distillation is distinctive in the following three aspects; First, consistency ensures to avoid forgetting labels. Second, consistency improves both uncertainty/diversity of labeled data. Lastly, consistency redeems defective labels produced by human annotators.
    COMBO: Conservative Offline Model-Based Policy Optimization. (arXiv:2102.08363v2 [cs.LG] UPDATED)
    (2 min) Model-based algorithms, which learn a dynamics model from logged experience and perform some sort of pessimistic planning under the learned model, have emerged as a promising paradigm for offline reinforcement learning (offline RL). However, practical variants of such model-based algorithms rely on explicit uncertainty quantification for incorporating pessimism. Uncertainty estimation with complex models, such as deep neural networks, can be difficult and unreliable. We overcome this limitation by developing a new model-based offline RL algorithm, COMBO, that regularizes the value function on out-of-support state-action tuples generated via rollouts under the learned model. This results in a conservative estimate of the value function for out-of-support state-action tuples, without requiring explicit uncertainty estimation. We theoretically show that our method optimizes a lower bound on the true policy value, that this bound is tighter than that of prior methods, and our approach satisfies a policy improvement guarantee in the offline setting. Through experiments, we find that COMBO consistently performs as well or better as compared to prior offline model-free and model-based methods on widely studied offline RL benchmarks, including image-based tasks.
    MixMix: All You Need for Data-Free Compression Are Feature and Data Mixing. (arXiv:2011.09899v3 [cs.LG] UPDATED)
    (2 min) User data confidentiality protection is becoming a rising challenge in the present deep learning research. Without access to data, conventional data-driven model compression faces a higher risk of performance degradation. Recently, some works propose to generate images from a specific pretrained model to serve as training data. However, the inversion process only utilizes biased feature statistics stored in one model and is from low-dimension to high-dimension. As a consequence, it inevitably encounters the difficulties of generalizability and inexact inversion, which leads to unsatisfactory performance. To address these problems, we propose MixMix based on two simple yet effective techniques: (1) Feature Mixing: utilizes various models to construct a universal feature space for generalized inversion; (2) Data Mixing: mixes the synthesized images and labels to generate exact label information. We prove the effectiveness of MixMix from both theoretical and empirical perspectives. Extensive experiments show that MixMix outperforms existing methods on the mainstream compression tasks, including quantization, knowledge distillation, and pruning. Specifically, MixMix achieves up to 4% and 20% accuracy uplift on quantization and pruning, respectively, compared to existing data-free compression work.
    Learning Constrained Adaptive Differentiable Predictive Control Policies With Guarantees. (arXiv:2004.11184v6 [eess.SY] UPDATED)
    (2 min) We present differentiable predictive control (DPC), a method for learning constrained neural control policies for linear systems with probabilistic performance guarantees. We employ automatic differentiation to obtain direct policy gradients by backpropagating the model predictive control (MPC) loss function and constraints penalties through a differentiable closed-loop system dynamics model. We demonstrate that the proposed method can learn parametric constrained control policies to stabilize systems with unstable dynamics, track time-varying references, and satisfy nonlinear state and input constraints. In contrast with imitation learning-based approaches, our method does not depend on a supervisory controller. Most importantly, we demonstrate that, without losing performance, our method is scalable and computationally more efficient than implicit, explicit, and approximate MPC. Under review at IEEE Transactions on Automatic Control.
    MQTransformer: Multi-Horizon Forecasts with Context Dependent and Feedback-Aware Attention. (arXiv:2009.14799v4 [cs.LG] UPDATED)
    (2 min) Recent advances in neural forecasting have produced major improvements in accuracy for probabilistic demand prediction. In this work, we propose novel improvements to the current state of the art by incorporating changes inspired by recent advances in Transformer architectures for Natural Language Processing. We develop a novel decoder-encoder attention for context-alignment, improving forecasting accuracy by allowing the network to study its own history based on the context for which it is producing a forecast. We also present a novel positional encoding that allows the neural network to learn context-dependent seasonality functions as well as arbitrary holiday distances. Finally we show that the current state of the art MQ-Forecaster (Wen et al., 2017) models display excess variability by failing to leverage previous errors in the forecast to improve accuracy. We propose a novel decoder-self attention scheme for forecasting that produces significant improvements in the excess variation of the forecast.
    Fast Moving Natural Evolution Strategy for High-Dimensional Problems. (arXiv:2201.11422v1 [cs.NE])
    (2 min) In this work, we propose a new variant of natural evolution strategies (NES) for high-dimensional black-box optimization problems. The proposed method, CR-FM-NES, extends a recently proposed state-of-the-art NES, Fast Moving Natural Evolution Strategy (FM-NES), in order to be applicable in high-dimensional problems. CR-FM-NES builds on an idea using a restricted representation of a covariance matrix instead of using a full covariance matrix, while inheriting an efficiency of FM-NES. The restricted representation of the covariance matrix enables CR-FM-NES to update parameters of a multivariate normal distribution in linear time and space complexity, which can be applied to high-dimensional problems. Our experimental results reveal that CR-FM-NES does not lose the efficiency of FM-NES, and on the contrary, CR-FM-NES has achieved significant speedup compared to FM-NES on some benchmark problems. Furthermore, our numerical experiments using 200, 600, and 1000-dimensional benchmark problems demonstrate that CR-FM-NES is effective over scalable baseline methods, VD-CMA and Sep-CMA.
    Mutual Distillation of Confident Knowledge. (arXiv:2106.01489v2 [cs.LG] UPDATED)
    (2 min) Mutual knowledge distillation (MKD) improves a model by distilling knowledge from another model. However, \textit{not all knowledge is certain and correct}, especially under adverse conditions. For example, label noise usually leads to less reliable models due to undesired memorization \cite{zhang2017understanding,arpit2017closer}. Wrong knowledge misleads the learning rather than helps. This problem can be handled by two aspects: (i) improving the reliability of a model where the knowledge is from (i.e., knowledge source's reliability); (ii) selecting reliable knowledge for distillation. In the literature, making a model more reliable is widely studied while selective MKD receives little attention. Therefore, we focus on studying selective MKD. Concretely, a generic MKD framework, \underline{C}onfident knowledge selection followed by \underline{M}utual \underline{D}istillation (CMD), is designed. The key component of CMD is a generic knowledge selection formulation, making the selection threshold either static (CMD-S) or progressive (CMD-P). Additionally, CMD covers two special cases: zero-knowledge and all knowledge, leading to a unified MKD framework. Extensive experiments are present to demonstrate the effectiveness of CMD and thoroughly justify the design of CMD. For example, CMD-P obtains new state-of-the-art results in robustness against label noise.
    Generalized Matrix Factorization: efficient algorithms for fitting generalized linear latent variable models to large data arrays. (arXiv:2010.02469v3 [cs.LG] UPDATED)
    (2 min) Unmeasured or latent variables are often the cause of correlations between multivariate measurements, which are studied in a variety of fields such as psychology, ecology, and medicine. For Gaussian measurements, there are classical tools such as factor analysis or principal component analysis with a well-established theory and fast algorithms. Generalized Linear Latent Variable models (GLLVMs) generalize such factor models to non-Gaussian responses. However, current algorithms for estimating model parameters in GLLVMs require intensive computation and do not scale to large datasets with thousands of observational units or responses. In this article, we propose a new approach for fitting GLLVMs to high-dimensional datasets, based on approximating the model using penalized quasi-likelihood and then using a Newton method and Fisher scoring to learn the model parameters. Computationally, our method is noticeably faster and more stable, enabling GLLVM fits to much larger matrices than previously possible. We apply our method on a dataset of 48,000 observational units with over 2,000 observed species in each unit and find that most of the variability can be explained with a handful of factors. We publish an easy-to-use implementation of our proposed fitting algorithm.
    Time Limits in Reinforcement Learning. (arXiv:1712.00378v4 [cs.LG] UPDATED)
    (2 min) In reinforcement learning, it is common to let an agent interact for a fixed amount of time with its environment before resetting it and repeating the process in a series of episodes. The task that the agent has to learn can either be to maximize its performance over (i) that fixed period, or (ii) an indefinite period where time limits are only used during training to diversify experience. In this paper, we provide a formal account for how time limits could effectively be handled in each of the two cases and explain why not doing so can cause state aliasing and invalidation of experience replay, leading to suboptimal policies and training instability. In case (i), we argue that the terminations due to time limits are in fact part of the environment, and thus a notion of the remaining time should be included as part of the agent's input to avoid violation of the Markov property. In case (ii), the time limits are not part of the environment and are only used to facilitate learning. We argue that this insight should be incorporated by bootstrapping from the value of the state at the end of each partial episode. For both cases, we illustrate empirically the significance of our considerations in improving the performance and stability of existing reinforcement learning algorithms, showing state-of-the-art results on several control tasks.
    Likelihood Contribution based Multi-scale Architecture for Generative Flows. (arXiv:1908.01686v3 [cs.LG] UPDATED)
    (2 min) Deep generative modeling using flows has gained popularity owing to the tractable exact log-likelihood estimation with efficient training and synthesis process. However, flow models suffer from the challenge of having high dimensional latent space, the same in dimension as the input space. An effective solution to the above challenge as proposed by Dinh et al. (2016) is a multi-scale architecture, which is based on iterative early factorization of a part of the total dimensions at regular intervals. Prior works on generative flow models involving a multi-scale architecture perform the dimension factorization based on static masking. We propose a novel multi-scale architecture that performs data-dependent factorization to decide which dimensions should pass through more flow layers. To facilitate the same, we introduce a heuristic based on the contribution of each dimension to the total log-likelihood which encodes the importance of the dimensions. Our proposed heuristic is readily obtained as part of the flow training process, enabling the versatile implementation of our likelihood contribution based multi-scale architecture for generic flow models. We present such implementations for several state-of-the-art flow models and demonstrate improvements in log-likelihood score and sampling quality on standard image benchmarks. We also conduct ablation studies to compare the proposed method with other options for dimension factorization.
    Advanced Stationary and Non-Stationary Kernel Designs for Domain-Aware Gaussian Processes. (arXiv:2102.03432v3 [stat.ML] UPDATED)
    (2 min) Gaussian process regression is a widely-applied method for function approximation and uncertainty quantification. The technique has gained popularity recently in the machine learning community due to its robustness and interpretability. The mathematical methods we discuss in this paper are an extension of the Gaussian-process framework. We are proposing advanced kernel designs that only allow for functions with certain desirable characteristics to be elements of the reproducing kernel Hilbert space (RKHS) that underlies all kernel methods and serves as the sample space for Gaussian process regression. These desirable characteristics reflect the underlying physics; two obvious examples are symmetry and periodicity constraints. In addition, non-stationary kernel designs can be defined in the same framework to yield flexible multi-task Gaussian processes. We will show the impact of advanced kernel designs on Gaussian processes using several synthetic and two scientific data sets. The results show that including domain knowledge, communicated through advanced kernel designs, has a significant impact on the accuracy and relevance of the function approximation.
    SSLGuard: A Watermarking Scheme for Self-supervised Learning Pre-trained Encoders. (arXiv:2201.11692v1 [cs.CR])
    (2 min) Self-supervised learning is an emerging machine learning (ML) paradigm. Compared to supervised learning that leverages high-quality labeled datasets to achieve good performance, self-supervised learning relies on unlabeled datasets to pre-train powerful encoders which can then be treated as feature extractors for various downstream tasks. The huge amount of data and computational resources consumption makes the encoders themselves become a valuable intellectual property of the model owner. Recent research has shown that the ML model's copyright is threatened by model stealing attacks, which aims to train a surrogate model to mimic the behavior of a given model. We empirically show that pre-trained encoders are highly vulnerable to model stealing attacks. However, most of the current efforts of copyright protection algorithms such as fingerprinting and watermarking concentrate on classifiers. Meanwhile, the intrinsic challenges of pre-trained encoder's copyright protection remain largely unstudied. We fill the gap by proposing SSLGuard, the first watermarking algorithm for pre-trained encoders. Given a clean pre-trained encoder, SSLGuard embeds a watermark into it and outputs a watermarked version. The shadow training technique is also applied to preserve the watermark under potential model stealing attacks. Our extensive evaluation shows that SSLGuard is effective in watermark injection and verification, and is robust against model stealing and other watermark removal attacks such as pruning and finetuning.
    Alleviating the Transit Timing Variations bias in transit surveys. II. RIVERS: Twin resonant Earth-sized planets around Kepler-1972 recovered from Kepler's false positive. (arXiv:2201.11459v1 [astro-ph.EP])
    (2 min) Transit Timing Variations (TTVs) can provide useful information for systems observed by transit, by putting constraints on the masses and eccentricities of the observed planets, or even constrain the existence of non-transiting companions. However, TTVs can also prevent the detection of small planets in transit surveys, or bias the recovered planetary and transit parameters. Here we show that Kepler-1972 c, initially the "not transit-like" false positive KOI-3184.02, is an Earth-sized planet whose orbit is perturbed by Kepler-1972 b (initially KOI-3184.01). The pair is locked in a 3:2 Mean-motion resonance, each planet displaying TTVs of more than 6h hours of amplitude over the duration of the Kepler mission. The two planets have similar masses $m_b/m_c =0.956_{-0.051}^{+0.056}$ and radii $R_b=0.802_{-0.041}^{+0.042}R_{Earth}$, $R_c=0.868_{-0.050}^{+0.051}R_{Earth}$, and the whole system, including the inner candidate KOI-3184.03, appear to be coplanar. Despite the faintness of the signals (SNR of 1.35 for each transit of Kepler-1972 b and 1.10 for Kepler-1972 c), we recovered the transits of the planets using the RIVERS method, based on the recognition of the tracks of planets in river diagrams using machine learning, and a photo-dynamic fit of the lightcurve. Recovering the correct ephemerides of the planets is essential to have a complete picture of the observed planetary systems. In particular, we show that in Kepler-1972, not taking into account planet-planet interactions yields an error of $\sim 30\%$ on the radii of planets b and c, in addition to generating in-transit scatter, which leads to mistake KOI3184.02 for a false positive. Alleviating this bias is essential for an unbiased view of Kepler systems, some of the TESS stars, and the upcoming PLATO mission.
    Domain-Invariant Representation Learning from EEG with Private Encoders. (arXiv:2201.11613v1 [cs.LG])
    (2 min) Deep learning based electroencephalography (EEG) signal processing methods are known to suffer from poor test-time generalization due to the changes in data distribution. This becomes a more challenging problem when privacy-preserving representation learning is of interest such as in clinical settings. To that end, we propose a multi-source learning architecture where we extract domain-invariant representations from dataset-specific private encoders. Our model utilizes a maximum-mean-discrepancy (MMD) based domain alignment approach to impose domain-invariance for encoded representations, which outperforms state-of-the-art approaches in EEG-based emotion classification. Furthermore, representations learned in our pipeline preserve domain privacy as dataset-specific private encoding alleviates the need for conventional, centralized EEG-based deep neural network training approaches with shared parameters.
    Implicit Regularization in Hierarchical Tensor Factorization and Deep Convolutional Neural Networks. (arXiv:2201.11729v1 [cs.LG])
    (2 min) In the pursuit of explaining implicit regularization in deep learning, prominent focus was given to matrix and tensor factorizations, which correspond to simplified neural networks. It was shown that these models exhibit implicit regularization towards low matrix and tensor ranks, respectively. Drawing closer to practical deep learning, the current paper theoretically analyzes the implicit regularization in hierarchical tensor factorization, a model equivalent to certain deep convolutional neural networks. Through a dynamical systems lens, we overcome challenges associated with hierarchy, and establish implicit regularization towards low hierarchical tensor rank. This translates to an implicit regularization towards locality for the associated convolutional networks. Inspired by our theory, we design explicit regularization discouraging locality, and demonstrate its ability to improve performance of modern convolutional networks on non-local tasks, in defiance of conventional wisdom by which architectural changes are needed. Our work highlights the potential of enhancing neural networks via theoretical analysis of their implicit regularization.
    As easy as APC: overcoming missing data and class imbalance in time series with self-supervised learning. (arXiv:2106.15577v5 [cs.LG] UPDATED)
    (2 min) High levels of missing data and strong class imbalance are ubiquitous challenges that are often presented simultaneously in real-world time series data. Existing methods approach these problems separately, frequently making significant assumptions about the underlying data generation process in order to lessen the impact of missing information. In this work, we instead demonstrate how a general self-supervised training method, namely Autoregressive Predictive Coding (APC), can be leveraged to overcome both missing data and class imbalance simultaneously without strong assumptions. Specifically, on a synthetic dataset, we show that standard baselines are substantially improved upon through the use of APC, yielding the greatest gains in the combined setting of high missingness and severe class imbalance. We further apply APC on two real-world medical time-series datasets, and show that APC improves the classification performance in all settings, ultimately achieving state-of-the-art AUPRC results on the Physionet benchmark.
    Learning Non-linear Wavelet Transformation via Normalizing Flow. (arXiv:2101.11306v2 [cs.LG] UPDATED)
    (2 min) Wavelet transformation stands as a cornerstone in modern data analysis and signal processing. Its mathematical essence is an invertible transformation that discerns slow patterns from fast ones in the frequency domain. Such an invertible transformation can be learned by a designed normalizing flow model. With a generalized lifting scheme as coupling layers, a factor-out layer resembling the downsampling, and parameter sharing at different levels of the model, one can train the normalizing flow to filter high-frequency elements at different levels, thus extending traditional linear wavelet transformations to learnable non-linear deep learning models. In this paper, a way of building such flow is proposed, along with a numerical analysis of the learned transformation. Then, we demonstrate the model's ability in image lossless compression, show it can achieve SOTA compression scores while achieving a small model size, substantial generalization ability, and the ability to handle high-dimensional data.
    To what extent should we trust AI models when they extrapolate?. (arXiv:2201.11260v1 [cs.LG])
    (2 min) Many applications affecting human lives rely on models that have come to be known under the umbrella of machine learning and artificial intelligence. These AI models are usually complicated mathematical functions that map from an input space to an output space. Stakeholders are interested to know the rationales behind models' decisions and functional behavior. We study this functional behavior in relation to the data used to create the models. On this topic, scholars have often assumed that models do not extrapolate, i.e., they learn from their training samples and process new input by interpolation. This assumption is questionable: we show that models extrapolate frequently; the extent of extrapolation varies and can be socially consequential. We demonstrate that extrapolation happens for a substantial portion of datasets more than one would consider reasonable. How can we trust models if we do not know whether they are extrapolating? Given a model trained to recommend clinical procedures for patients, can we trust the recommendation when the model considers a patient older or younger than all the samples in the training set? If the training set is mostly Whites, to what extent can we trust its recommendations about Black and Hispanic patients? Which dimension (race, gender, or age) does extrapolation happen? Even if a model is trained on people of all races, it still may extrapolate in significant ways related to race. The leading question is, to what extent can we trust AI models when they process inputs that fall outside their training set? This paper investigates several social applications of AI, showing how models extrapolate without notice. We also look at different sub-spaces of extrapolation for specific individuals subject to AI models and report how these extrapolations can be interpreted, not mathematically, but from a humanistic point of view.
    A Systematic Study of Bias Amplification. (arXiv:2201.11706v1 [cs.LG])
    (2 min) Recent research suggests that predictions made by machine-learning models can amplify biases present in the training data. When a model amplifies bias, it makes certain predictions at a higher rate for some groups than expected based on training-data statistics. Mitigating such bias amplification requires a deep understanding of the mechanics in modern machine learning that give rise to that amplification. We perform the first systematic, controlled study into when and how bias amplification occurs. To enable this study, we design a simple image-classification problem in which we can tightly control (synthetic) biases. Our study of this problem reveals that the strength of bias amplification is correlated to measures such as model accuracy, model capacity, model overconfidence, and amount of training data. We also find that bias amplification can vary greatly during training. Finally, we find that bias amplification may depend on the difficulty of the classification task relative to the difficulty of recognizing group membership: bias amplification appears to occur primarily when it is easier to recognize group membership than class membership. Our results suggest best practices for training machine-learning models that we hope will help pave the way for the development of better mitigation strategies.
    Scalable Learning for Optimal Load Shedding Under Power Grid Emergency Operations. (arXiv:2111.11980v2 [cs.LG] UPDATED)
    (2 min) Effective and timely responses to unexpected contingencies are crucial for enhancing the resilience of power grids. Given the fast, complex process of cascading propagation, corrective actions such as optimal load shedding (OLS) are difficult to attain in large-scale networks due to the computation complexity and communication latency issues. This work puts forth an innovative learning-for-OLS approach by constructing the optimal decision rules of load shedding under a variety of potential contingency scenarios through offline neural network (NN) training. Notably, the proposed NN-based OLS decisions are fully decentralized, enabling individual load centers to quickly react to the specific contingency using readily available local measurements. Numerical studies on the IEEE 14-bus system have demonstrated the effectiveness of our scalable OLS design for real-time responses to severe grid emergency events.
    Exploring the Geometry and Topology of Neural Network Loss Landscapes. (arXiv:2102.00485v2 [cs.LG] UPDATED)
    (2 min) Recent work has established clear links between the generalization performance of trained neural networks and the geometry of their loss landscape near the local minima to which they converge. This suggests that qualitative and quantitative examination of the loss landscape geometry could yield insights about neural network generalization performance during training. To this end, researchers have proposed visualizing the loss landscape through the use of simple dimensionality reduction techniques. However, such visualization methods have been limited by their linear nature and only capture features in one or two dimensions, thus restricting sampling of the loss landscape to lines or planes. Here, we expand and improve upon these in three ways. First, we present a novel "jump and retrain" procedure for sampling relevant portions of the loss landscape. We show that the resulting sampled data holds more meaningful information about the network's ability to generalize. Next, we show that non-linear dimensionality reduction of the jump and retrain trajectories via PHATE, a trajectory and manifold-preserving method, allows us to visualize differences between networks that are generalizing well vs poorly. Finally, we combine PHATE trajectories with a computational homology characterization to quantify trajectory differences.
    Scalable Spike Source Localization in Extracellular Recordings using Amortized Variational Inference. (arXiv:1905.12375v4 [q-bio.NC] UPDATED)
    (2 min) Determining the positions of neurons in an extracellular recording is useful for investigating functional properties of the underlying neural circuitry. In this work, we present a Bayesian modelling approach for localizing the source of individual spikes on high-density, microelectrode arrays. To allow for scalable inference, we implement our model as a variational autoencoder and perform amortized variational inference. We evaluate our method on both biophysically realistic simulated and real extracellular datasets, demonstrating that it is more accurate than and can improve spike sorting performance over heuristic localization methods such as center of mass.
    Optimal transport weights for causal inference. (arXiv:2109.01991v3 [stat.ME] UPDATED)
    (2 min) Weighting methods are a common tool to de-bias estimates of causal effects. And though there are an increasing number of seemingly disparate methods, many of them can be folded into one unifying regime: Causal Optimal Transport. This new method directly targets distributional balance by minimizing optimal transport distances between treatment and control groups or, more generally, between a source and target population. Our approach is semiparametrically efficient and model-free but can also incorporate moments or any other important functions of covariates that the researcher desires to balance. We find that Causal Optimal Transport outperforms competitor methods when both the propensity score and outcome models are misspecified, indicating it is a robust alternative to common weighting methods. Finally, we demonstrate the utility of our method in an external control study examining the effect of misoprostol versus oxytocin for the treatment of post-partum hemorrhage.
    An Analysis on Ensemble Learning optimized Medical Image Classification with Deep Convolutional Neural Networks. (arXiv:2201.11440v1 [cs.CV])
    (2 min) Novel and high-performance medical image classification pipelines are heavily utilizing ensemble learning strategies. The idea of ensemble learning is to assemble diverse models or multiple predictions and, thus, boost prediction performance. However, it is still an open question to what extent as well as which ensemble learning strategies are beneficial in deep learning based medical image classification pipelines. In this work, we proposed a reproducible medical image classification pipeline for analyzing the performance impact of the following ensemble learning techniques: Augmenting, Stacking, and Bagging. The pipeline consists of state-of-the-art preprocessing and image augmentation methods as well as 9 deep convolution neural network architectures. It was applied on four popular medical imaging datasets with varying complexity. Furthermore, 12 pooling functions for combining multiple predictions were analyzed, ranging from simple statistical functions like unweighted averaging up to more complex learning-based functions like support vector machines. Our results revealed that Stacking achieved the largest performance gain of up to 13% F1-score increase. Augmenting showed consistent improvement capabilities by up to 4% and is also applicable to single model based pipelines. Cross-validation based Bagging demonstrated to be the most complex ensemble learning method, which resulted in an F1-score decrease in all analyzed datasets (up to -10%). Furthermore, we demonstrated that simple statistical pooling functions are equal or often even better than more complex pooling functions. We concluded that the integration of Stacking and Augmentation ensemble learning techniques is a powerful method for any medical image classification pipeline to improve robustness and boost performance.
    Implicit Regularization of Bregman Proximal Point Algorithm and Mirror Descent on Separable Data. (arXiv:2108.06808v3 [cs.LG] UPDATED)
    (2 min) Bregman proximal point algorithm (BPPA), as one of the centerpieces in the optimization toolbox, has been witnessing emerging applications. With simple and easy to implement update rule, the algorithm bears several compelling intuitions for empirical successes, yet rigorous justifications are still largely unexplored. We study the computational properties of BPPA through classification tasks with separable data, and demonstrate provable algorithmic regularization effects associated with BPPA. We show that BPPA attains non-trivial margin, which closely depends on the condition number of the distance generating function inducing the Bregman divergence. We further demonstrate that the dependence on the condition number is tight for a class of problems, thus showing the importance of divergence in affecting the quality of the obtained solutions. In addition, we extend our findings to mirror descent (MD), for which we establish similar connections between the margin and Bregman divergence. We demonstrate through a concrete example, and show BPPA/MD converges in direction to the maximal margin solution with respect to the Mahalanobis distance. Our theoretical findings are among the first to demonstrate the benign learning properties BPPA/MD, and also provide corroborations for a careful choice of divergence in the algorithmic design.
    Confidence May Cheat: Self-Training on Graph Neural Networks under Distribution Shift. (arXiv:2201.11349v1 [cs.LG])
    (2 min) Graph Convolutional Networks (GCNs) have recently attracted vast interest and achieved state-of-the-art performance on graphs, but its success could typically hinge on careful training with amounts of expensive and time-consuming labeled data. To alleviate labeled data scarcity, self-training methods have been widely adopted on graphs by labeling high-confidence unlabeled nodes and then adding them to the training step. In this line, we empirically make a thorough study for current self-training methods on graphs. Surprisingly, we find that high-confidence unlabeled nodes are not always useful, and even introduce the distribution shift issue between the original labeled dataset and the augmented dataset by self-training, severely hindering the capability of self-training on graphs. To this end, in this paper, we propose a novel Distribution Recovered Graph Self-Training framework (DR-GST), which could recover the distribution of the original labeled dataset. Specifically, we first prove the equality of loss function in self-training framework under the distribution shift case and the population distribution if each pseudo-labeled node is weighted by a proper coefficient. Considering the intractability of the coefficient, we then propose to replace the coefficient with the information gain after observing the same changing trend between them, where information gain is respectively estimated via both dropout variational inference and dropedge variational inference in DR-GST. However, such a weighted loss function will enlarge the impact of incorrect pseudo labels. As a result, we apply the loss correction method to improve the quality of pseudo labels. Both our theoretical analysis and extensive experiments on five benchmark datasets demonstrate the effectiveness of the proposed DR-GST, as well as each well-designed component in DR-GST.
    Can we Generalize and Distribute Private Representation Learning?. (arXiv:2010.01792v4 [cs.LG] UPDATED)
    (2 min) We study the problem of learning representations that are private yet informative i.e., provide information about intended "ally" targets while hiding sensitive "adversary" attributes). We propose Exclusion-Inclusion Generative Adversarial Network (EIGAN), a generalized private representation learning (PRL) architecture that accounts for multiple ally and adversary attributes unlike existing PRL solutions. While centrally-aggregated dataset is a prerequisite for most PRL techniques, data in real-world is often siloed across multiple distributed nodes unwilling to share the raw data because of privacy concerns. We address this practical constraint by developing D-EIGAN, the first distributed PRL method that learns representations at each node without transmitting the source data. We theoretically analyze the behavior of adversaries under the optimal EIGAN and D-EIGAN encoders and the impact of dependencies among ally and adversary tasks on the optimization objective. Our experiments on various datasets demonstrate the advantages of EIGAN in terms of performance, robustness, and scalability. In particular, EIGAN outperforms the previous state-of-the-art by a significant accuracy margin (47% improvement), and D-EIGAN's performance is consistently on par with EIGAN under different network settings.
    LiteLSTM Architecture for Deep Recurrent Neural Networks. (arXiv:2201.11624v1 [cs.LG])
    (2 min) Long short-term memory (LSTM) is a robust recurrent neural network architecture for learning spatiotemporal sequential data. However, it requires significant computational power for learning and implementing from both software and hardware aspects. This paper proposes a novel LiteLSTM architecture based on reducing the computation components of the LSTM using the weights sharing concept to reduce the overall architecture cost and maintain the architecture performance. The proposed LiteLSTM can be significant for learning big data where time-consumption is crucial such as the security of IoT devices and medical data. Moreover, it helps to reduce the CO2 footprint. The proposed model was evaluated and tested empirically on two different datasets from computer vision and cybersecurity domains.
    From Motion to Muscle. (arXiv:2201.11501v1 [cs.LG])
    (2 min) Voluntary human motion is the product of muscle activity that results from upstream motion planning of the motor cortical areas. We show that muscle activity can be artificially generated based on motion features such as position, velocity, and acceleration. For this purpose, we specifically develop an approach based on recurrent neural network that is trained in a supervised learning session; additional neural network architectures are considered and evaluated. The performance is evaluated by a new score called the zero-line score. The latter adaptively rescales the loss function of the generated signal for all channels comparing the overall range of muscle activity and thus dynamically evaluates similarities between both signals. The model achieves remarkable precision for previously trained movements and maintains significantly high precision for new movements that have not been previously trained. Further, these models are trained on multiple subjects and thus are able to generalize across individuals. In addition, we distinguish between a general model that has been trained on several subjects, a subject-specific model, and a specific pre-trained model that uses the general model as a basis and is adapted to a specific subject afterward. The subject-specific generation of muscle activity can be further used to improve the rehabilitation of neuromuscular diseases with myoelectric prostheses and functional electric stimulation.
    Reinforced Cooperative Load Balancing in Data Center. (arXiv:2201.11727v1 [cs.DC])
    (2 min) Network load balancers are central components in modern data centers, that cooperatively distribute workloads of high arrival rates across application servers, thereby contribute to offering scalable services. The independent and "selfish" load balancing strategy is not necessarily the globally optimal one. This paper represents the load balancing problem as a cooperative team-game with limited observations over system states, and adopts multi-agent reinforcement learning methods to make fair load balancing decisions without inducing additional processing latency. On both a simulation and an emulation system, the proposed method is evaluated against other load balancing algorithms, including state-of-the-art heuristics and learning-based strategies. Experiments under different settings and complexities show the advantageous performance of the proposed method.
    Team Yao at Factify 2022: Utilizing Pre-trained Models and Co-attention Networks for Multi-Modal Fact Verification. (arXiv:2201.11664v1 [cs.CV])
    (2 min) In recent years, social media has enabled users to get exposed to a myriad of misinformation and disinformation; thus, misinformation has attracted a great deal of attention in research fields and as a social issue. To address the problem, we propose a framework, Pre-CoFact, composed of two pre-trained models for extracting features from text and images, and multiple co-attention networks for fusing the same modality but different sources and different modalities. Besides, we adopt the ensemble method by using different pre-trained models in Pre-CoFact to achieve better performance. We further illustrate the effectiveness from the ablation study and examine different pre-trained models for comparison. Our team, Yao, won the fifth prize (F1-score: 74.585\%) in the Factify challenge hosted by De-Factify @ AAAI 2022, which demonstrates that our model achieved competitive performance without using auxiliary tasks or extra information. The source code of our work is publicly available at https://github.com/wywyWang/Multi-Modal-Fact-Verification-2021
    MeltpoolNet: Melt pool Characteristic Prediction in Metal Additive Manufacturing Using Machine Learning. (arXiv:2201.11662v1 [cs.LG])
    (2 min) Characterizing meltpool shape and geometry is essential in metal Additive Manufacturing (MAM) to control the printing process and avoid defects. Predicting meltpool flaws based on process parameters and powder material is difficult due to the complex nature of MAM process. Machine learning (ML) techniques can be useful in connecting process parameters to the type of flaws in the meltpool. In this work, we introduced a comprehensive framework for benchmarking ML for melt pool characterization. An extensive experimental dataset has been collected from more than 80 MAM articles containing MAM processing conditions, materials, meltpool dimensions, meltpool modes and flaw types. We introduced physics-aware MAM featurization, versatile ML models, and evaluation metrics to create a comprehensive learning framework for meltpool defect and geometry prediction. This benchmark can serve as a basis for melt pool control and process optimization. In addition, data-driven explicit models have been identified to estimate meltpool geometry from process parameters and material properties which outperform Rosenthal estimation for meltpool geometry while maintaining interpretability.
    Model Agnostic Interpretability for Multiple Instance Learning. (arXiv:2201.11701v1 [cs.LG])
    (2 min) In Multiple Instance Learning (MIL), models are trained using bags of instances, where only a single label is provided for each bag. A bag label is often only determined by a handful of key instances within a bag, making it difficult to interpret what information a classifier is using to make decisions. In this work, we establish the key requirements for interpreting MIL models. We then go on to develop several model-agnostic approaches that meet these requirements. Our methods are compared against existing inherently interpretable MIL models on several datasets, and achieve an increase in interpretability accuracy of up to 30%. We also examine the ability of the methods to identify interactions between instances and scale to larger datasets, improving their applicability to real-world problems.
    Capture Agent Free Biosensing using Porous Silicon Arrays and Machine Learning. (arXiv:2201.11671v1 [physics.med-ph])
    (2 min) Biosensors are an essential tool for medical diagnostics, environmental monitoring and food safety. Typically, biosensors are designed to detect specific analytes through functionalization with the appropriate capture agents. However, the use of capture agents limits the number of analytes that can be simultaneously detected and reduces the robustness of the biosensor. In this work, we report a versatile, capture agent free biosensor platform based on an array of porous silicon (PSi) thin films, which has the potential to robustly detect a wide variety of analytes based on their physical and chemical properties in the nanoscale porous media. The ability of this system to reproducibly classify, quantify, and discriminate three proteins is demonstrated to concentrations down to at least 0.02g/L (between 300nM and 450nM) by utilizing PSi array elements with a unique combination of pore size and buffer pH, employing linear discriminant analysis for dimensionality reduction, and using support vector machines as a classifier. This approach represents a significant step towards a low cost, simple and robust biosensor platform that is able to detect a vast range of biomolecules.
    Dissecting the impact of different loss functions with gradient surgery. (arXiv:2201.11307v1 [cs.CV])
    (2 min) Pair-wise loss is an approach to metric learning that learns a semantic embedding by optimizing a loss function that encourages images from the same semantic class to be mapped closer than images from different classes. The literature reports a large and growing set of variations of the pair-wise loss strategies. Here we decompose the gradient of these loss functions into components that relate to how they push the relative feature positions of the anchor-positive and anchor-negative pairs. This decomposition allows the unification of a large collection of current pair-wise loss functions. Additionally, explicitly constructing pair-wise gradient updates to separate out these effects gives insights into which have the biggest impact, and leads to a simple algorithm that beats the state of the art for image retrieval on the CAR, CUB and Stanford Online products datasets.
    Monotonic Alpha-divergence Minimisation for Variational Inference. (arXiv:2103.05684v2 [stat.CO] UPDATED)
    (2 min) In this paper, we introduce a novel family of iterative algorithms which carry out $\alpha$-divergence minimisation in a Variational Inference context. They do so by ensuring a systematic decrease at each step in the $\alpha$-divergence between the variational and the posterior distributions. In its most general form, the variational distribution is a mixture model and our framework allows us to simultaneously optimise the weights and components parameters of this mixture model. Notably, our approach permits to build on various methods previously proposed for $\alpha$-divergence minimisation such as Gradient or Power Descent schemes and we also shed a new light on an integrated Expectation Maximization algorithm. Lastly, we provide empirical evidence that our methodology yields improved results on several multimodal target distributions.
    Bit-serial Weight Pools: Compression and Arbitrary Precision Execution of Neural Networks on Resource Constrained Processors. (arXiv:2201.11651v1 [cs.LG])
    (2 min) Applications of neural networks on edge systems have proliferated in recent years but the ever-increasing model size makes neural networks not able to deploy on resource-constrained microcontrollers efficiently. We propose bit-serial weight pools, an end-to-end framework that includes network compression and acceleration of arbitrary sub-byte precision. The framework can achieve up to 8x compression compared to 8-bit networks by sharing a pool of weights across the entire network. We further propose a bit-serial lookup based software implementation that allows runtime-bitwidth tradeoff and is able to achieve more than 2.8x speedup and 7.5x storage compression compared to 8-bit weight pool networks, with less than 1% accuracy drop.
    Domain generalization in deep learning-based mass detection in mammography: A large-scale multi-center study. (arXiv:2201.11620v1 [cs.CV])
    (2 min) Computer-aided detection systems based on deep learning have shown great potential in breast cancer detection. However, the lack of domain generalization of artificial neural networks is an important obstacle to their deployment in changing clinical environments. In this work, we explore the domain generalization of deep learning methods for mass detection in digital mammography and analyze in-depth the sources of domain shift in a large-scale multi-center setting. To this end, we compare the performance of eight state-of-the-art detection methods, including Transformer-based models, trained in a single domain and tested in five unseen domains. Moreover, a single-source mass detection training pipeline is designed to improve the domain generalization without requiring images from the new domain. The results show that our workflow generalizes better than state-of-the-art transfer learning-based approaches in four out of five domains while reducing the domain shift caused by the different acquisition protocols and scanner manufacturers. Subsequently, an extensive analysis is performed to identify the covariate shifts with bigger effects on the detection performance, such as due to differences in patient age, breast density, mass size, and mass malignancy. Ultimately, this comprehensive study provides key insights and best practices for future research on domain generalization in deep learning-based breast cancer detection.
    Simplicial Convolutional Filters. (arXiv:2201.11720v1 [eess.SP])
    (2 min) We study linear filters for processing signals supported on abstract topological spaces modeled as simplicial complexes, which may be interpreted as generalizations of graphs that account for nodes, edges, triangular faces etc. To process such signals, we develop simplicial convolutional filters defined as matrix polynomials of the lower and upper Hodge Laplacians. First, we study the properties of these filters and show that they are linear and shift-invariant, as well as permutation and orientation equivariant. These filters can also be implemented in a distributed fashion with a low computational complexity, as they involve only (multiple rounds of) simplicial shifting between upper and lower adjacent simplices. Second, focusing on edge-flows, we study the frequency responses of these filters and examine how we can use the Hodge-decomposition to delineate gradient, curl and harmonic frequencies. We discuss how these frequencies correspond to the lower- and the upper-adjacent couplings and the kernel of the Hodge Laplacian, respectively, and can be tuned independently by our filter designs. Third, we study different procedures for designing simplicial convolutional filters and discuss their relative advantages. Finally, we corroborate our simplicial filters in several applications: to extract different frequency components of a simplicial signal, to denoise edge flows, and to analyze financial markets and traffic networks.
    The Implicit Bias of Benign Overfitting. (arXiv:2201.11489v1 [cs.LG])
    (2 min) The phenomenon of benign overfitting, where a predictor perfectly fits noisy training data while attaining low expected loss, has received much attention in recent years, but still remains not fully understood beyond simple linear regression setups. In this paper, we show that for regression, benign overfitting is "biased" towards certain types of problems, in the sense that its existence on one learning problem excludes its existence on other learning problems. On the negative side, we use this to argue that one should not expect benign overfitting to occur in general, for several natural extensions of the plain linear regression problems studied so far. We then turn to classification problems, and show that the situation there is much more favorable. Specifically, we consider a model where an arbitrary input distribution of some fixed dimension $k$ is concatenated with a high-dimensional distribution, and prove that the max-margin predictor (to which gradient-based methods are known to converge in direction) is asymptotically biased towards minimizing the expected *squared hinge loss* w.r.t. the $k$-dimensional distribution. This allows us to reduce the question of benign overfitting in classification to the simpler question of whether this loss is a good surrogate for the prediction error, and use it to show benign overfitting in some new settings.
    Learning Mixtures of Linear Dynamical Systems. (arXiv:2201.11211v1 [stat.ML])
    (2 min) We study the problem of learning a mixture of multiple linear dynamical systems (LDSs) from unlabeled short sample trajectories, each generated by one of the LDS models. Despite the wide applicability of mixture models for time-series data, learning algorithms that come with end-to-end performance guarantees are largely absent from existing literature. There are multiple sources of technical challenges, including but not limited to (1) the presence of latent variables (i.e. the unknown labels of trajectories); (2) the possibility that the sample trajectories might have lengths much smaller than the dimension $d$ of the LDS models; and (3) the complicated temporal dependence inherent to time-series data. To tackle these challenges, we develop a two-stage meta-algorithm, which is guaranteed to efficiently recover each ground-truth LDS model up to error $\tilde{O}(\sqrt{d/T})$, where $T$ is the total sample size. We validate our theoretical studies with numerical experiments, confirming the efficacy of the proposed algorithm.
    Human Interpretation of Saliency-based Explanation Over Text. (arXiv:2201.11569v1 [cs.CL])
    (2 min) While a lot of research in explainable AI focuses on producing effective explanations, less work is devoted to the question of how people understand and interpret the explanation. In this work, we focus on this question through a study of saliency-based explanations over textual data. Feature-attribution explanations of text models aim to communicate which parts of the input text were more influential than others towards the model decision. Many current explanation methods, such as gradient-based or Shapley value-based methods, provide measures of importance which are well-understood mathematically. But how does a person receiving the explanation (the explainee) comprehend it? And does their understanding match what the explanation attempted to communicate? We empirically investigate the effect of various factors of the input, the feature-attribution explanation, and visualization procedure, on laypeople's interpretation of the explanation. We query crowdworkers for their interpretation on tasks in English and German, and fit a GAMM model to their responses considering the factors of interest. We find that people often mis-interpret the explanations: superficial and unrelated factors, such as word length, influence the explainees' importance assignment despite the explanation communicating importance directly. We then show that some of this distortion can be attenuated: we propose a method to adjust saliencies based on model estimates of over- and under-perception, and explore bar charts as an alternative to heatmap saliency visualization. We find that both approaches can attenuate the distorting effect of specific factors, leading to better-calibrated understanding of the explanation.
    Towards a Secure and Reliable Federated Learning using Blockchain. (arXiv:2201.11311v1 [cs.CR])
    (2 min) Federated learning (FL) is a distributed machine learning (ML) technique that enables collaborative training in which devices perform learning using a local dataset while preserving their privacy. This technique ensures privacy, communication efficiency, and resource conservation. Despite these advantages, FL still suffers from several challenges related to reliability (i.e., unreliable participating devices in training), tractability (i.e., a large number of trained models), and anonymity. To address these issues, we propose a secure and trustworthy blockchain framework (SRB-FL) tailored to FL, which uses blockchain features to enable collaborative model training in a fully distributed and trustworthy manner. In particular, we design a secure FL based on the blockchain sharding that ensures data reliability, scalability, and trustworthiness. In addition, we introduce an incentive mechanism to improve the reliability of FL devices using subjective multi-weight logic. The results show that our proposed SRB-FL framework is efficient and scalable, making it a promising and suitable solution for federated learning.
    On the Role of Multi-Objective Optimization to the Transit Network Design Problem. (arXiv:2201.11616v1 [cs.LG])
    (2 min) Ongoing traffic changes, including those triggered by the COVID-19 pandemic, reveal the necessity to adapt our public transport systems to the ever-changing users' needs. This work shows that single and multi objective stances can be synergistically combined to better answer the transit network design problem (TNDP). Single objective formulations are dynamically inferred from the rating of networks in the approximated (multi-objective) Pareto Front, where a regression approach is used to infer the optimal weights of transfer needs, times, distances, coverage, and costs. As a guiding case study, the solution is applied to the multimodal public transport network in the city of Lisbon, Portugal. The system takes individual trip data given by smartcard validations at CARRIS buses and METRO subway stations and uses them to estimate the origin-destination demand in the city. Then, Genetic Algorithms are used, considering both single and multi objective approaches, to redesign the bus network that better fits the observed traffic demand. The proposed TNDP optimization proved to improve results, with reductions in objective functions of up to 28.3%. The system managed to extensively reduce the number of routes, and all passenger related objectives, including travel time and transfers per trip, significantly improve. Grounded on automated fare collection data, the system can incrementally redesign the bus network to dynamically handle ongoing changes to the city traffic.
    Benchmarking learned non-Cartesian k-space trajectories and reconstruction networks. (arXiv:2201.11356v1 [cs.LG])
    (2 min) We benchmark the current existing methods to jointly learn non-Cartesian k-space trajectory and reconstruction: PILOT, BJORK, and compare them with those obtained from the recently developed generalized hybrid learning (HybLearn) framework. We present the advantages of using projected gradient descent to enforce MR scanner hardware constraints as compared to using added penalties in the cost function. Further, we use the novel HybLearn scheme to jointly learn and compare our results through a retrospective study on fastMRI validation dataset.
    Stock2Vec: An Embedding to Improve Predictive Models for Companies. (arXiv:2201.11290v1 [cs.LG])
    (2 min) Building predictive models for companies often relies on inference using historical data of companies in the same industry sector. However, companies are similar across a variety of dimensions that should be leveraged in relevant prediction problems. This is particularly true for large, complex organizations which may not be well defined by a single industry and have no clear peers. To enable prediction using company information across a variety of dimensions, we create an embedding of company stocks, Stock2Vec, which can be easily added to any prediction model that applies to companies with associated stock prices. We describe the process of creating this rich vector representation from stock price fluctuations, and characterize what the dimensions represent. We then conduct comprehensive experiments to evaluate this embedding in applied machine learning problems in various business contexts. Our experiment results demonstrate that the four features in the Stock2Vec embedding can readily augment existing cross-company models and enhance cross-company predictions.
    Transformer Module Networks for Systematic Generalization in Visual Question Answering. (arXiv:2201.11316v1 [cs.CV])
    (2 min) Transformer-based models achieve great performance on Visual Question Answering (VQA). However, when we evaluate them on systematic generalization, i.e., handling novel combinations of known concepts, their performance degrades. Neural Module Networks (NMNs) are a promising approach for systematic generalization that consists on composing modules, i.e., neural networks that tackle a sub-task. Inspired by Transformers and NMNs, we propose Transformer Module Network (TMN), a novel Transformer-based model for VQA that dynamically composes modules into a question-specific Transformer network. TMNs achieve state-of-the-art systematic generalization performance in three VQA datasets, namely, CLEVR-CoGenT, CLOSURE and GQA-SGL, in some cases improving more than 30% over standard Transformers.
    Controlling Directions Orthogonal to a Classifier. (arXiv:2201.11259v1 [cs.LG])
    (2 min) We propose to identify directions invariant to a given classifier so that these directions can be controlled in tasks such as style transfer. While orthogonal decomposition is directly identifiable when the given classifier is linear, we formally define a notion of orthogonality in the non-linear case. We also provide a surprisingly simple method for constructing the orthogonal classifier (a classifier utilizing directions other than those of the given classifier). Empirically, we present three use cases where controlling orthogonal variation is important: style transfer, domain adaptation, and fairness. The orthogonal classifier enables desired style transfer when domains vary in multiple aspects, improves domain adaptation with label shifts and mitigates the unfairness as a predictor. The code is available at this http URL
    Towards Data-driven LQR with KoopmanizingFlows. (arXiv:2201.11640v1 [eess.SY])
    (2 min) We propose a novel framework for learning linear time-invariant (LTI) models for a class of continuous-time non-autonomous nonlinear dynamics based on a representation of Koopman operators. In general, the operator is infinite-dimensional but, crucially, linear. To utilize it for efficient LTI control, we learn a finite representation of the Koopman operator that is linear in controls while concurrently learning meaningful lifting coordinates. For the latter, we rely on KoopmanizingFlows - a diffeomorphism-based representation of Koopman operators. With such a learned model, we can replace the nonlinear infinite-horizon optimal control problem with quadratic costs to that of a linear quadratic regulator (LQR), facilitating efficacious optimal control for nonlinear systems. The prediction and control efficacy of the proposed method is verified on simulation examples.
    Density-Aware Hyper-Graph Neural Networks for Graph-based Semi-supervised Node Classification. (arXiv:2201.11511v1 [cs.LG])
    (2 min) Graph-based semi-supervised learning, which can exploit the connectivity relationship between labeled and unlabeled data, has been shown to outperform the state-of-the-art in many artificial intelligence applications. One of the most challenging problems for graph-based semi-supervised node classification is how to use the implicit information among various data to improve the performance of classifying. Traditional studies on graph-based semi-supervised learning have focused on the pairwise connections among data. However, the data correlation in real applications could be beyond pairwise and more complicated. The density information has been demonstrated to be an important clue, but it is rarely explored in depth among existing graph-based semi-supervised node classification methods. To develop a flexible and effective model for graph-based semi-supervised node classification, we propose a novel Density-Aware Hyper-Graph Neural Networks (DA-HGNN). In our proposed approach, hyper-graph is provided to explore the high-order semantic correlation among data, and a density-aware hyper-graph attention network is presented to explore the high-order connection relationship. Extensive experiments are conducted in various benchmark datasets, and the results demonstrate the effectiveness of the proposed approach.
    Multi-view learning with privileged weighted twin support vector machine. (arXiv:2201.11306v1 [stat.ML])
    (2 min) Weighted twin support vector machines (WLTSVM) mines as much potential similarity information in samples as possible to improve the common short-coming of non-parallel plane classifiers. Compared with twin support vector machines (TWSVM), it reduces the time complexity by deleting the superfluous constraints using the inter-class K-Nearest Neighbor (KNN). Multi-view learning (MVL) is a newly developing direction of machine learning, which focuses on learning acquiring information from the data indicated by multiple feature sets. In this paper, we propose multi-view learning with privileged weighted twin support vector machines (MPWTSVM). It not only inherits the advantages of WLTSVM but also has its characteristics. Firstly, it enhances generalization ability by mining intra-class information from the same perspective. Secondly, it reduces the redundancy constraints with the help of inter-class information, thus improving the running speed. Most importantly, it can follow both the consensus and the complementarity principle simultaneously as a multi-view classification model. The consensus principle is realized by minimizing the coupling items of the two views in the original objective function. The complementary principle is achieved by establishing privileged information paradigms and MVL. A standard quadratic programming solver is used to solve the problem. Compared with multi-view classification models such as SVM-2K, MVTSVM, MCPK, and PSVM-2V, our model has better accuracy and classification efficiency. Experimental results on 45 binary data sets prove the effectiveness of our method.
    Representation learnt by SGD and Adaptive learning rules -- Conditions that Vary Sparsity and Selectivity in Neural Network. (arXiv:2201.11653v1 [cs.LG])
    (2 min) From the point of view of the human brain, continual learning can perform various tasks without mutual interference. An effective way to reduce mutual interference can be found in sparsity and selectivity of neurons. According to Aljundi et al. and Hadsell et al., imposing sparsity at the representational level is advantageous for continual learning because sparse neuronal activations encourage less overlap between parameters, resulting in less interference. Similarly, highly selective neural networks are likely to induce less interference since particular response in neurons will reduce the chance of overlap with other parameters. Considering that the human brain performs continual learning over the lifespan, finding conditions where sparsity and selectivity naturally arises may provide insight for understanding how the brain functions. This paper investigates various conditions that naturally increase sparsity and selectivity in a neural network. This paper tested different optimizers with Hoyer's sparsity metric and CCMAS selectivity metric in MNIST classification task. It is essential to note that investigations on the natural occurrence of sparsity and selectivity concerning various conditions have not been acknowledged in any sector of neuroscience nor machine learning until this day. This paper found that particular conditions increase sparsity and selectivity such as applying a large learning rate and lowering a batch size. In addition to the relationship between the condition, sparsity, and selectivity, the following will be discussed based on empirical analysis: 1. The relationship between sparsity and selectivity and 2. The relationship between test accuracy, sparsity, and selectivity.
    Achieving Personalized Federated Learning with Sparse Local Models. (arXiv:2201.11380v1 [cs.LG])
    (2 min) Federated learning (FL) is vulnerable to heterogeneously distributed data, since a common global model in FL may not adapt to the heterogeneous data distribution of each user. To counter this issue, personalized FL (PFL) was proposed to produce dedicated local models for each individual user. However, PFL is far from its maturity, because existing PFL solutions either demonstrate unsatisfactory generalization towards different model architectures or cost enormous extra computation and memory. In this work, we propose federated learning with personalized sparse mask (FedSpa), a novel PFL scheme that employs personalized sparse masks to customize sparse local models on the edge. Instead of training an intact (or dense) PFL model, FedSpa only maintains a fixed number of active parameters throughout training (aka sparse-to-sparse training), which enables users' models to achieve personalization with cheap communication, computation, and memory cost. We theoretically show that the iterates obtained by FedSpa converge to the local minimizer of the formulated SPFL problem at rate of $\mathcal{O}(\frac{1}{\sqrt{T}})$. Comprehensive experiments demonstrate that FedSpa significantly saves communication and computation costs, while simultaneously achieves higher model accuracy and faster convergence speed against several state-of-the-art PFL methods.
    Deep Recurrent Learning for Heart Sounds Segmentation based on Instantaneous Frequency Features. (arXiv:2201.11320v1 [eess.AS])
    (2 min) In this work, a novel stack of well-known technologies is presented to determine an automatic method to segment the heart sounds in a phonocardiogram (PCG). We will show a deep recurrent neural network (DRNN) capable of segmenting a PCG into its main components and a very specific way of extracting instantaneous frequency that will play an important role in the training and testing of the proposed model. More specifically, it involves a Long Short-Term Memory (LSTM) neural network accompanied by the Fourier Synchrosqueezed Transform (FSST) used to extract instantaneous time-frequency features from a PCG. The present approach was tested on heart sound signals longer than 5 seconds and shorter than 35 seconds from freely-available databases. This approach proved that, with a relatively small architecture, a small set of data, and the right features, this method achieved an almost state-of-the-art performance, showing an average sensitivity of 89.5%, an average positive predictive value of 89.3\% and an average accuracy of 91.3%.
    Model Generalization in Arrival Runway Occupancy Time Prediction by Feature Equivalences. (arXiv:2201.11654v1 [cs.LG])
    (2 min) General real-time runway occupancy time prediction modelling for multiple airports is a current research gap. An attempt to generalize a real-time prediction model for Arrival Runway Occupancy Time (AROT) is presented in this paper by substituting categorical features by their numerical equivalences. Three days of data, collected from Saab Sensis' Aerobahn system at three US airports, has been used for this work. Three tree-based machine learning algorithms: Decision Tree, Random Forest and Gradient Boosting are used to assess the generalizability of the model using numerical equivalent features. We have shown that the model trained on numerical equivalent features not only have performances at least on par with models trained on categorical features but also can make predictions on unseen data from other airports.
    Distributed gradient-based optimization in the presence of dependent aperiodic communication. (arXiv:2201.11343v1 [math.OC])
    (2 min) Iterative distributed optimization algorithms involve multiple agents that communicate with each other, over time, in order to minimize/maximize a global objective. In the presence of unreliable communication networks, the Age-of-Information (AoI), which measures the freshness of data received, may be large and hence hinder algorithmic convergence. In this paper, we study the convergence of general distributed gradient-based optimization algorithms in the presence of communication that neither happens periodically nor at stochastically independent points in time. We show that convergence is guaranteed provided the random variables associated with the AoI processes are stochastically dominated by a random variable with finite first moment. This improves on previous requirements of boundedness of more than the first moment. We then introduce stochastically strongly connected (SSC) networks, a new stochastic form of strong connectedness for time-varying networks. We show: If for any $p \ge0$ the processes that describe the success of communication between agents in a SSC network are $\alpha$-mixing with $n^{p-1}\alpha(n)$ summable, then the associated AoI processes are stochastically dominated by a random variable with finite $p$-th moment. In combination with our first contribution, this implies that distributed stochastic gradient descend converges in the presence of AoI, if $\alpha(n)$ is summable.
    FairMod: Fair Link Prediction and Recommendation via Graph Modification. (arXiv:2201.11596v1 [cs.LG])
    (2 min) As machine learning becomes more widely adopted across domains, it is critical that researchers and ML engineers think about the inherent biases in the data that may be perpetuated by the model. Recently, many studies have shown that such biases are also imbibed in Graph Neural Network (GNN) models if the input graph is biased. In this work, we aim to mitigate the bias learned by GNNs through modifying the input graph. To that end, we propose FairMod, a Fair Graph Modification methodology with three formulations: the Global Fairness Optimization (GFO), Community Fairness Optimization (CFO), and Fair Edge Weighting (FEW) models. Our proposed models perform either microscopic or macroscopic edits to the input graph while training GNNs and learn node embeddings that are both accurate and fair under the context of link recommendations. We demonstrate the effectiveness of our approach on four real world datasets and show that we can improve the recommendation fairness by several factors at negligible cost to link prediction accuracy.
    Capacity and Achievable Rates of Fading Few-mode MIMO IM/DD Optical Fiber Channels. (arXiv:2201.11538v1 [cs.IT])
    (2 min) The optical fiber multiple-input multiple-output (MIMO) channel with intensity modulation and direct detection (IM/DD) per spatial path is treated. The spatial dimensions represent the multiple modes employed for transmission and the cross-talk between them originates in the multiplexers and demultiplexers, which are polarization dependent and thus timevarying. The upper bounds from free-space IM/DD MIMO channels are adapted to the fiber case, and the constellation constrained capacity is constructively estimated using the Blahut-Arimoto algorithm. An autoencoder is then proposed to optimize a practical MIMO transmission in terms of pre-coder and detector assuming channel distribution knowledge at the transmitter. The pre-coders are shown to be robust to changes in the channel.
    Neuro-Symbolic Entropy Regularization. (arXiv:2201.11250v1 [cs.LG])
    (2 min) In structured prediction, the goal is to jointly predict many output variables that together encode a structured object -- a path in a graph, an entity-relation triple, or an ordering of objects. Such a large output space makes learning hard and requires vast amounts of labeled data. Different approaches leverage alternate sources of supervision. One approach -- entropy regularization -- posits that decision boundaries should lie in low-probability regions. It extracts supervision from unlabeled examples, but remains agnostic to the structure of the output space. Conversely, neuro-symbolic approaches exploit the knowledge that not every prediction corresponds to a valid structure in the output space. Yet, they does not further restrict the learned output distribution. This paper introduces a framework that unifies both approaches. We propose a loss, neuro-symbolic entropy regularization, that encourages the model to confidently predict a valid object. It is obtained by restricting entropy regularization to the distribution over only valid structures. This loss is efficiently computed when the output constraint is expressed as a tractable logic circuit. Moreover, it seamlessly integrates with other neuro-symbolic losses that eliminate invalid predictions. We demonstrate the efficacy of our approach on a series of semi-supervised and fully-supervised structured-prediction experiments, where we find that it leads to models whose predictions are more accurate and more likely to be valid.
    Synchromesh: Reliable code generation from pre-trained language models. (arXiv:2201.11227v1 [cs.LG])
    (2 min) Large pre-trained language models have been used to generate code,providing a flexible interface for synthesizing programs from natural language specifications. However, they often violate syntactic and semantic rules of their output language, limiting their practical usability. In this paper, we propose Synchromesh: a framework for substantially improving the reliability of pre-trained models for code generation. Synchromesh comprises two components. First, it retrieves few-shot examples from a training bank using Target Similarity Tuning (TST), a novel method for semantic example selection. TST learns to recognize utterances that describe similar target programs despite differences in surface natural language features. Then, Synchromesh feeds the examples to a pre-trained language model and samples programs using Constrained Semantic Decoding (CSD): a general framework for constraining the output to a set of valid programs in the target language. CSD leverages constraints on partial outputs to sample complete correct programs, and needs neither re-training nor fine-tuning of the language model. We evaluate our methods by synthesizing code from natural language descriptions using GPT-3 and Codex in three real-world languages: SQL queries, Vega-Lite visualizations and SMCalFlow programs. These domains showcase rich constraints that CSD is able to enforce, including syntax, scope, typing rules, and contextual logic. We observe substantial complementary gains from CSD and TST in prediction accuracy and in effectively preventing run-time errors.
    Multimodal neural networks better explain multivoxel patterns in the hippocampus. (arXiv:2201.11517v1 [q-bio.NC])
    (2 min) The human hippocampus possesses "concept cells", neurons that fire when presented with stimuli belonging to a specific concept, regardless of the modality. Recently, similar concept cells were discovered in a multimodal network called CLIP (Radford et at., 2021). Here, we ask whether CLIP can explain the fMRI activity of the human hippocampus better than a purely visual (or linguistic) model. We extend our analysis to a range of publicly available uni- and multi-modal models. We demonstrate that "multimodality" stands out as a key component when assessing the ability of a network to explain the multivoxel activity in the hippocampus.
    Predicting Succinylation Sites in Proteins with Improved Deep Learning Architecture. (arXiv:2201.11215v1 [q-bio.BM])
    (2 min) Post-translational modifications (PTMs) in proteins occur after the process of translation. PTMs account for many cellular processes such as deoxyribonucleic acid (DNA) repair, cell signaling and cell death. One of the recent PTMs is succinylation. Succinylation modifies lysine residue from $-1$ to $+1$. Locating succinylation sites using experimental methods, such as mass spectrometry is very laborious. Hence, computational methods are favored using machine learning techniques. This paper proposes a deep learning architecture to predict succinylation sites. The performance of the proposed architecture is compared to the state-of-the-art deep learning architecture and other traditional machine learning techniques for succinylation. It is shown from the performance metrics that the proposed architecture provides a good trade-off between speed of computation and classification accuracy.
    Fairness implications of encoding protected categorical attributes. (arXiv:2201.11358v1 [cs.LG])
    (2 min) Protected attributes are often presented as categorical features that need to be encoded before feeding them into a machine learning algorithm. Encoding these attributes is paramount as they determine the way the algorithm will learn from the data. Categorical feature encoding has a direct impact on the model performance and fairness. In this work, we compare the accuracy and fairness implications of the two most well-known encoders: one-hot encoding and target encoding. We distinguish between two types of induced bias that can arise while using these encodings and can lead to unfair models. The first type, irreducible bias, is due to direct group category discrimination and a second type, reducible bias, is due to large variance in less statistically represented groups. We take a deeper look into how regularization methods for target encoding can improve the induced bias while encoding categorical features. Furthermore, we tackle the problem of intersectional fairness that arises when mixing two protected categorical features leading to higher cardinality. This practice is a powerful feature engineering technique used for boosting model performance. We study its implications on fairness as it can increase both types of induced bias
    Jointly Learning Knowledge Embedding and Neighborhood Consensus with Relational Knowledge Distillation for Entity Alignment. (arXiv:2201.11249v1 [cs.LG])
    (2 min) Entity alignment aims at integrating heterogeneous knowledge from different knowledge graphs. Recent studies employ embedding-based methods by first learning the representation of Knowledge Graphs and then performing entity alignment via measuring the similarity between entity embeddings. However, they failed to make good use of the relation semantic information due to the trade-off problem caused by the different objectives of learning knowledge embedding and neighborhood consensus. To address this problem, we propose Relational Knowledge Distillation for Entity Alignment (RKDEA), a Graph Convolutional Network (GCN) based model equipped with knowledge distillation for entity alignment. We adopt GCN-based models to learn the representation of entities by considering the graph structure and incorporating the relation semantic information into GCN via knowledge distillation. Then, we introduce a novel adaptive mechanism to transfer relational knowledge so as to jointly learn entity embedding and neighborhood consensus. Experimental results on several benchmarking datasets demonstrate the effectiveness of our proposed model.
    Mapping the Buried Cable by Ground Penetrating Radar and Gaussian-Process Regression. (arXiv:2201.11253v1 [cs.LG])
    (2 min) With the rapid expansion of urban areas and the increasingly use of electricity, the need for locating buried cables is becoming urgent. In this paper, a noval method to locate underground cables based on Ground Penetrating Radar (GPR) and Gaussian-process regression is proposed. Firstly, the coordinate system of the detected area is conducted, and the input and output of locating buried cables are determined. The GPR is moved along the established parallel detection lines, and the hyperbolic signatures generated by buried cables are identified and fitted, thus the positions and depths of some points on the cable could be derived. On the basis of the established coordinate system and the derived points on the cable, the clustering method and cable fitting algorithm based on Gaussian-process regression are proposed to find the most likely locations of the underground cables. Furthermore, the confidence intervals of the cable's locations are also obtained. Both the position and depth noises are taken into account in our method, ensuring the robustness and feasibility in different environments and equipments. Experiments on real-world datasets are conducted, and the obtained results demonstrate the effectiveness of the proposed method.
    Inference-optimized AI and high performance computing for gravitational wave detection at scale. (arXiv:2201.11133v1 [gr-qc])
    (2 min) We introduce an ensemble of artificial intelligence models for gravitational wave detection that we trained in the Summit supercomputer using 32 nodes, equivalent to 192 NVIDIA V100 GPUs, within 2 hours. Once fully trained, we optimized these models for accelerated inference using NVIDIA TensorRT. We deployed our inference-optimized AI ensemble in the ThetaGPU supercomputer at Argonne Leadership Computer Facility to conduct distributed inference. Using the entire ThetaGPU supercomputer, consisting of 20 nodes each of which has 8 NVIDIA A100 Tensor Core GPUs and 2 AMD Rome CPUs, our NVIDIA TensorRT-optimized AI ensemble porcessed an entire month of advanced LIGO data (including Hanford and Livingston data streams) within 50 seconds. Our inference-optimized AI ensemble retains the same sensitivity of traditional AI models, namely, it identifies all known binary black hole mergers previously identified in this advanced LIGO dataset and reports no misclassifications, while also providing a 3X inference speedup compared to traditional artificial intelligence models. We used time slides to quantify the performance of our AI ensemble to process up to 5 years worth of advanced LIGO data. In this synthetically enhanced dataset, our AI ensemble reports an average of one misclassification for every month of searched advanced LIGO data. We also present the receiver operating characteristic curve of our AI ensemble using this 5 year long advanced LIGO dataset. This approach provides the required tools to conduct accelerated, AI-driven gravitational wave detection at scale.
    A dual approach for federated learning. (arXiv:2201.11183v1 [cs.LG])
    (2 min) We study the federated optimization problem from a dual perspective and propose a new algorithm termed federated dual coordinate descent (FedDCD), which is based on a type of coordinate descent method developed by Necora et al. [Journal of Optimization Theory and Applications, 2017]. Additionally, we enhance the FedDCD method with inexact gradient oracles and Nesterov's acceleration. We demonstrate theoretically that our proposed approach achieves better convergence rates than the state-of-the-art primal federated optimization algorithms under mild conditions. Numerical experiments on real-world datasets support our analysis.
    Crystal structure prediction with machine learning-based element substitution. (arXiv:2201.11188v1 [cond-mat.mtrl-sci])
    (2 min) The prediction of energetically stable crystal structures formed by a given chemical composition is a central problem in solid-state physics. In principle, the crystalline state of assembled atoms can be determined by optimizing the energy surface, which in turn can be evaluated using first-principles calculations. However, performing the iterative gradient descent on the potential energy surface using first-principles calculations is prohibitively expensive for complex systems, such as those with many atoms per unit cell. Here, we present a unique methodology for crystal structure prediction (CSP) that relies on a machine learning algorithm called metric learning. It is shown that a binary classifier, trained on a large number of already identified crystal structures, can determine the isomorphism of crystal structures formed by two given chemical compositions with an accuracy of approximately 96.4\%. For a given query composition with an unknown crystal structure, the model is used to automatically select from a crystal structure database a set of template crystals with nearly identical stable structures to which element substitution is to be applied. Apart from the local relaxation calculation of the identified templates, the proposed method does not use ab initio calculations. The potential of this substation-based CSP is demonstrated for a wide variety of crystal systems.
    Challenges and Opportunities for Machine Learning Classification of Behavior and Mental State from Images. (arXiv:2201.11197v1 [cs.CV])
    (2 min) Computer Vision (CV) classifiers which distinguish and detect nonverbal social human behavior and mental state can aid digital diagnostics and therapeutics for psychiatry and the behavioral sciences. While CV classifiers for traditional and structured classification tasks can be developed with standard machine learning pipelines for supervised learning consisting of data labeling, preprocessing, and training a convolutional neural network, there are several pain points which arise when attempting this process for behavioral phenotyping. Here, we discuss the challenges and corresponding opportunities in this space, including handling heterogeneous data, avoiding biased models, labeling massive and repetitive data sets, working with ambiguous or compound class labels, managing privacy concerns, creating appropriate representations, and personalizing models. We discuss current state-of-the-art research endeavors in CV such as data curation, data augmentation, crowdsourced labeling, active learning, reinforcement learning, generative models, representation learning, federated learning, and meta-learning. We highlight at least some of the machine learning advancements needed for imaging classifiers to detect human social cues successfully and reliably.
    DNNFuser: Generative Pre-Trained Transformer as a Generalized Mapper for Layer Fusion in DNN Accelerators. (arXiv:2201.11218v1 [cs.LG])
    (2 min) Dataflow/mapping decides the compute and energy efficiency of DNN accelerators. Many mappers have been proposed to tackle the intra-layer map-space. However, mappers for inter-layer map-space (aka layer-fusion map-space), have been rarely discussed. In this work, we propose a mapper, DNNFuser, specifically focusing on this layer-fusion map-space. While existing SOTA DNN mapping explorations rely on search-based mappers, this is the first work, to the best of our knowledge, to propose a one-shot inference-based mapper. We leverage a famous language model GPT as our DNN architecture to learn layer-fusion optimization as a sequence modeling problem. Further, the trained DNNFuser can generalize its knowledge and infer new solutions for unseen conditions. Within one inference pass, DNNFuser can infer solutions with compatible performance to the ones found by a highly optimized search-based mapper while being 66x-127x faster.
    Hyperparameter Tuning for Deep Reinforcement Learning Applications. (arXiv:2201.11182v1 [cs.LG])
    (2 min) Reinforcement learning (RL) applications, where an agent can simply learn optimal behaviors by interacting with the environment, are quickly gaining tremendous success in a wide variety of applications from controlling simple pendulums to complex data centers. However, setting the right hyperparameters can have a huge impact on the deployed solution performance and reliability in the inference models, produced via RL, used for decision-making. Hyperparameter search itself is a laborious process that requires many iterations and computationally expensive to find the best settings that produce the best neural network architectures. In comparison to other neural network architectures, deep RL has not witnessed much hyperparameter tuning, due to its algorithm complexity and simulation platforms needed. In this paper, we propose a distributed variable-length genetic algorithm framework to systematically tune hyperparameters for various RL applications, improving training time and robustness of the architecture, via evolution. We demonstrate the scalability of our approach on many RL problems (from simple gyms to complex applications) and compared with Bayesian approach. Our results show that with more generations, optimal solutions that require fewer training episodes and are computationally cheap while being more robust for deployment. Our results are imperative to advance deep reinforcement learning controllers for real-world problems.
    Heterogeneous Peer Effects in the Linear Threshold Model. (arXiv:2201.11242v1 [cs.SI])
    (2 min) The Linear Threshold Model is a widely used model that describes how information diffuses through a social network. According to this model, an individual adopts an idea or product after the proportion of their neighbors who have adopted it reaches a certain threshold. Typical applications of the Linear Threshold Model assume that thresholds are either the same for all network nodes or randomly distributed, even though some people may be more susceptible to peer pressure than others. To address individual-level differences, we propose causal inference methods for estimating individual thresholds that can more accurately predict whether and when individuals will be affected by their peers. We introduce the concept of heterogeneous peer effects and develop a Structural Causal Model which corresponds to the Linear Threshold Model and supports heterogeneous peer effect identification and estimation. We develop two algorithms for individual threshold estimation, one based on causal trees and one based on causal meta-learners. Our experimental results on synthetic and real-world datasets show that our proposed models can better predict individual-level thresholds in the Linear Threshold Model and thus more precisely predict which nodes will get activated over time.
    Online Change Point Detection for Weighted and Directed Random Dot Product Graphs. (arXiv:2201.11222v1 [cs.LG])
    (2 min) Given a sequence of random (directed and weighted) graphs, we address the problem of online monitoring and detection of changes in the underlying data distribution. Our idea is to endow sequential change-point detection (CPD) techniques with a graph representation learning substrate based on the versatile Random Dot Product Graph (RDPG) model. We consider efficient, online updates of a judicious monitoring function, which quantifies the discrepancy between the streaming graph observations and the nominal RDPG. This reference distribution is inferred via spectral embeddings of the first few graphs in the sequence. We characterize the distribution of this running statistic to select thresholds that guarantee error-rate control, and under simplifying approximations we offer insights on the algorithm's detection resolution and delay. The end result is a lightweight online CPD algorithm, that is also explainable by virtue of the well-appreciated interpretability of RDPG embeddings. This is in stark contrast with most existing graph CPD approaches, which either rely on extensive computation, or they store and process the entire observed time series. An apparent limitation of the RDPG model is its suitability for undirected and unweighted graphs only, a gap we aim to close here to broaden the scope of the CPD framework. Unlike previous proposals, our non-parametric RDPG model for weighted graphs does not require a priori specification of the weights' distribution to perform inference and estimation. This network modeling contribution is of independent interest beyond CPD. We offer an open-source implementation of the novel online CPD algorithm for weighted and direct graphs, whose effectiveness and efficiency are demonstrated via (reproducible) synthetic and real network data experiments.
    On The Energy Statistics of Feature Maps in Pruning of Neural Networks with Skip-Connections. (arXiv:2201.11209v1 [cs.LG])
    (2 min) We propose a new structured pruning framework for compressing Deep Neural Networks (DNNs) with skip connections, based on measuring the statistical dependency of hidden layers and predicted outputs. The dependence measure defined by the energy statistics of hidden layers serves as a model-free measure of information between the feature maps and the output of the network. The estimated dependence measure is subsequently used to prune a collection of redundant and uninformative layers. Model-freeness of our measure guarantees that no parametric assumptions on the feature map distribution are required, making it computationally appealing for very high dimensional feature space in DNNs. Extensive numerical experiments on various architectures show the efficacy of the proposed pruning approach with competitive performance to state-of-the-art methods.
    Reinforcement Learning Based Query Vertex Ordering Model for Subgraph Matching. (arXiv:2201.11251v1 [cs.LG])
    (2 min) Subgraph matching is a fundamental problem in various fields that use graph structured data. Subgraph matching algorithms enumerate all isomorphic embeddings of a query graph q in a data graph G. An important branch of matching algorithms exploit the backtracking search approach which recursively extends intermediate results following a matching order of query vertices. It has been shown that the matching order plays a critical role in time efficiency of these backtracking based subgraph matching algorithms. In recent years, many advanced techniques for query vertex ordering (i.e., matching order generation) have been proposed to reduce the unpromising intermediate results according to the preset heuristic rules. In this paper, for the first time we apply the Reinforcement Learning (RL) and Graph Neural Networks (GNNs) techniques to generate the high-quality matching order for subgraph matching algorithms. Instead of using the fixed heuristics to generate the matching order, our model could capture and make full use of the graph information, and thus determine the query vertex order with the adaptive learning-based rule that could significantly reduces the number of redundant enumerations. With the help of the reinforcement learning framework, our model is able to consider the long-term benefits rather than only consider the local information at current ordering step.Extensive experiments on six real-life data graphs demonstrate that our proposed matching order generation technique could reduce up to two orders of magnitude of query processing time compared to the state-of-the-art algorithms.
    Born-Infeld (BI) for AI: Energy-Conserving Descent (ECD) for Optimization. (arXiv:2201.11137v1 [cs.LG])
    (2 min) We introduce a novel framework for optimization based on energy-conserving Hamiltonian dynamics in a strongly mixing (chaotic) regime and establish its key properties analytically and numerically. The prototype is a discretization of Born-Infeld dynamics, with a squared relativistic speed limit depending on the objective function. This class of frictionless, energy-conserving optimizers proceeds unobstructed until slowing naturally near the minimal loss, which dominates the phase space volume of the system. Building from studies of chaotic systems such as dynamical billiards, we formulate a specific algorithm with good performance on machine learning and PDE-solving tasks, including generalization. It cannot stop at a high local minimum and cannot overshoot the global minimum, yielding an advantage in non-convex loss functions, and proceeds faster than GD+momentum in shallow valleys.
    Objective Prediction of Tomorrow's Affect Using Multi-Modal Physiological Data and Personal Chronicles: A Study of Monitoring College Student Well-being in 2020. (arXiv:2201.11230v1 [cs.HC])
    (2 min) Monitoring and understanding affective states are important aspects of healthy functioning and treatment of mood-based disorders. Recent advancements of ubiquitous wearable technologies have increased the reliability of such tools in detecting and accurately estimating mental states (e.g., mood, stress, etc.), offering comprehensive and continuous monitoring of individuals over time. Previous attempts to model an individual's mental state were limited to subjective approaches or the inclusion of only a few modalities (i.e., phone, watch). Thus, the goal of our study was to investigate the capacity to more accurately predict affect through a fully automatic and objective approach using multiple commercial devices. Longitudinal physiological data and daily assessments of emotions were collected from a sample of college students using smart wearables and phones for over a year. Results showed that our model was able to predict next-day affect with accuracy comparable to state of the art methods.
    Quantile-Based Policy Optimization for Reinforcement Learning. (arXiv:2201.11463v1 [cs.LG])
    (2 min) Classical reinforcement learning (RL) aims to optimize the expected cumulative rewards. In this work, we consider the RL setting where the goal is to optimize the quantile of the cumulative rewards. We parameterize the policy controlling actions by neural networks and propose a novel policy gradient algorithm called Quantile-Based Policy Optimization (QPO) and its variant Quantile-Based Proximal Policy Optimization (QPPO) to solve deep RL problems with quantile objectives. QPO uses two coupled iterations running at different time scales for simultaneously estimating quantiles and policy parameters and is shown to converge to the global optimal policy under certain conditions. Our numerical results demonstrate that the proposed algorithms outperform the existing baseline algorithms under the quantile criterion.
    Semantic Code Classification for Automated Machine Learning. (arXiv:2201.11252v1 [cs.LG])
    (2 min) A range of applications for automatic machine learning need the generation process to be controllable. In this work, we propose a way to control the output via a sequence of simple actions, that are called semantic code classes. Finally, we present a semantic code classification task and discuss methods for solving this problem on the Natural Language to Machine Learning (NL2ML) dataset.
    Gap Minimization for Knowledge Sharing and Transfer. (arXiv:2201.11231v1 [cs.LG])
    (2 min) Learning from multiple related tasks by knowledge sharing and transfer has become increasingly relevant over the last two decades. In order to successfully transfer information from one task to another, it is critical to understand the similarities and differences between the domains. In this paper, we introduce the notion of \emph{performance gap}, an intuitive and novel measure of the distance between learning tasks. Unlike existing measures which are used as tools to bound the difference of expected risks between tasks (e.g., $\mathcal{H}$-divergence or discrepancy distance), we theoretically show that the performance gap can be viewed as a data- and algorithm-dependent regularizer, which controls the model complexity and leads to finer guarantees. More importantly, it also provides new insights and motivates a novel principle for designing strategies for knowledge sharing and transfer: gap minimization. We instantiate this principle with two algorithms: 1. {gapBoost}, a novel and principled boosting algorithm that explicitly minimizes the performance gap between source and target domains for transfer learning; and 2. {gapMTNN}, a representation learning algorithm that reformulates gap minimization as semantic conditional matching for multitask learning. Our extensive evaluation on both transfer learning and multitask learning benchmark data sets shows that our methods outperform existing baselines.
    Attention cannot be an Explanation. (arXiv:2201.11194v1 [cs.HC])
    (2 min) Attention based explanations (viz. saliency maps), by providing interpretability to black box models such as deep neural networks, are assumed to improve human trust and reliance in the underlying models. Recently, it has been shown that attention weights are frequently uncorrelated with gradient-based measures of feature importance. Motivated by this, we ask a follow-up question: "Assuming that we only consider the tasks where attention weights correlate well with feature importance, how effective are these attention based explanations in increasing human trust and reliance in the underlying models?". In other words, can we use attention as an explanation? We perform extensive human study experiments that aim to qualitatively and quantitatively assess the degree to which attention based explanations are suitable in increasing human trust and reliance. Our experiment results show that attention cannot be used as an explanation.
    Generative Trees: Adversarial and Copycat. (arXiv:2201.11205v1 [cs.LG])
    (2 min) While Generative Adversarial Networks (GANs) achieve spectacular results on unstructured data like images, there is still a gap on tabular data, data for which state of the art supervised learning still favours to a large extent decision tree (DT)-based models. This paper proposes a new path forward for the generation of tabular data, exploiting decades-old understanding of the supervised task's best components for DT induction, from losses (properness), models (tree-based) to algorithms (boosting). The \textit{properness} condition on the supervised loss -- which postulates the optimality of Bayes rule -- leads us to a variational GAN-style loss formulation which is \textit{tight} when discriminators meet a calibration property trivially satisfied by DTs, and, under common assumptions about the supervised loss, yields "one loss to train against them all" for the generator: the $\chi^2$. We then introduce tree-based generative models, \textit{generative trees} (GTs), meant to mirror on the generative side the good properties of DTs for classifying tabular data, with a boosting-compliant \textit{adversarial} training algorithm for GTs. We also introduce \textit{copycat training}, in which the generator copies at run time the underlying tree (graph) of the discriminator DT and completes it for the hardest discriminative task, with boosting compliant convergence. We test our algorithms on tasks including fake/real distinction, training from fake data and missing data imputation. Each one of these tasks displays that GTs can provide comparatively simple -- and interpretable -- contenders to sophisticated state of the art methods for data generation (using neural network models) or missing data imputation (relying on multiple imputation by chained equations with complex tree-based modeling).
    On the Convergence of mSGD and AdaGrad for Stochastic Optimization. (arXiv:2201.11204v1 [cs.LG])
    (2 min) As one of the most fundamental stochastic optimization algorithms, stochastic gradient descent (SGD) has been intensively developed and extensively applied in machine learning in the past decade. There have been some modified SGD-type algorithms, which outperform the SGD in many competitions and applications in terms of convergence rate and accuracy, such as momentum-based SGD (mSGD) and adaptive gradient algorithm (AdaGrad). Despite these empirical successes, the theoretical properties of these algorithms have not been well established due to technical difficulties. With this motivation, we focus on convergence analysis of mSGD and AdaGrad for any smooth (possibly non-convex) loss functions in stochastic optimization. First, we prove that the iterates of mSGD are asymptotically convergent to a connected set of stationary points with probability one, which is more general than existing works on subsequence convergence or convergence of time averages. Moreover, we prove that the loss function of mSGD decays at a certain rate faster than that of SGD. In addition, we prove the iterates of AdaGrad are asymptotically convergent to a connected set of stationary points with probability one. Also, this result extends the results from the literature on subsequence convergence and the convergence of time averages. Despite the generality of the above convergence results, we have relaxed some assumptions of gradient noises, convexity of loss functions, as well as boundedness of iterates.
    Reward-Free RL is No Harder Than Reward-Aware RL in Linear Markov Decision Processes. (arXiv:2201.11206v1 [cs.LG])
    (2 min) Reward-free reinforcement learning (RL) considers the setting where the agent does not have access to a reward function during exploration, but must propose a near-optimal policy for an arbitrary reward function revealed only after exploring. In the the tabular setting, it is well known that this is a more difficult problem than PAC RL -- where the agent has access to the reward function during exploration -- with optimal sample complexities in the two settings differing by a factor of $|\mathcal{S}|$, the size of the state space. We show that this separation does not exist in the setting of linear MDPs. We first develop a computationally efficient algorithm for reward-free RL in a $d$-dimensional linear MDP with sample complexity scaling as $\mathcal{O}(d^2/\epsilon^2)$. We then show a matching lower bound of $\Omega(d^2/\epsilon^2)$ on PAC RL. To our knowledge, our approach is the first computationally efficient algorithm to achieve optimal $d$ dependence in linear MDPs, even in the single-reward PAC setting. Our algorithm relies on a novel procedure which efficiently traverses a linear MDP, collecting samples in any given "feature direction", and enjoys a sample complexity scaling optimally in the (linear MDP equivalent of the) maximal state visitation probability. We show that this exploration procedure can also be applied to solve the problem of obtaining "well-conditioned" covariates in linear MDPs.
    IMACS: Image Model Attribution Comparison Summaries. (arXiv:2201.11196v1 [cs.LG])
    (2 min) Developing a suitable Deep Neural Network (DNN) often requires significant iteration, where different model versions are evaluated and compared. While metrics such as accuracy are a powerful means to succinctly describe a model's performance across a dataset or to directly compare model versions, practitioners often wish to gain a deeper understanding of the factors that influence a model's predictions. Interpretability techniques such as gradient-based methods and local approximations can be used to examine small sets of inputs in fine detail, but it can be hard to determine if results from small sets generalize across a dataset. We introduce IMACS, a method that combines gradient-based model attributions with aggregation and visualization techniques to summarize differences in attributions between two DNN image models. More specifically, IMACS extracts salient input features from an evaluation dataset, clusters them based on similarity, then visualizes differences in model attributions for similar input features. In this work, we introduce a framework for aggregating, summarizing, and comparing the attribution information for two models across a dataset; present visualizations that highlight differences between 2 image classification models; and show how our technique can uncover behavioral differences caused by domain shift between two models trained on satellite images.
    OntoProtein: Protein Pretraining With Gene Ontology Embedding. (arXiv:2201.11147v1 [q-bio.BM])
    (2 min) Self-supervised protein language models have proved their effectiveness in learning the proteins representations. With the increasing computational power, current protein language models pre-trained with millions of diverse sequences can advance the parameter scale from million-level to billion-level and achieve remarkable improvement. However, those prevailing approaches rarely consider incorporating knowledge graphs (KGs), which can provide rich structured knowledge facts for better protein representations. We argue that informative biology knowledge in KGs can enhance protein representation with external knowledge. In this work, we propose OntoProtein, the first general framework that makes use of structure in GO (Gene Ontology) into protein pre-training models. We construct a novel large-scale knowledge graph that consists of GO and its related proteins, and gene annotation texts or protein sequences describe all nodes in the graph. We propose novel contrastive learning with knowledge-aware negative sampling to jointly optimize the knowledge graph and protein embedding during pre-training. Experimental results show that OntoProtein can surpass state-of-the-art methods with pre-trained protein language models in TAPE benchmark and yield better performance compared with baselines in protein-protein interaction and protein function prediction. Code and datasets are available in https://github.com/zjunlp/OntoProtein.
    Self-Certifying Classification by Linearized Deep Assignment. (arXiv:2201.11162v1 [stat.ML])
    (2 min) We propose a novel class of deep stochastic predictors for classifying metric data on graphs within the PAC-Bayes risk certification paradigm. Classifiers are realized as linearly parametrized deep assignment flows with random initial conditions. Building on the recent PAC-Bayes literature and data-dependent priors, this approach enables (i) to use risk bounds as training objectives for learning posterior distributions on the hypothesis space and (ii) to compute tight out-of-sample risk certificates of randomized classifiers more efficiently than related work. Comparison with empirical test set errors illustrates the performance and practicality of this self-certifying classification method.
    Towards Agnostic Feature-based Dynamic Pricing: Linear Policies vs Linear Valuation with Unknown Noise. (arXiv:2201.11341v1 [cs.LG])
    (2 min) In feature-based dynamic pricing, a seller sets appropriate prices for a sequence of products (described by feature vectors) on the fly by learning from the binary outcomes of previous sales sessions ("Sold" if valuation $\geq$ price, and "Not Sold" otherwise). Existing works either assume noiseless linear valuation or precisely-known noise distribution, which limits the applicability of those algorithms in practice when these assumptions are hard to verify. In this work, we study two more agnostic models: (a) a "linear policy" problem where we aim at competing with the best linear pricing policy while making no assumptions on the data, and (b) a "linear noisy valuation" problem where the random valuation is linear plus an unknown and assumption-free noise. For the former model, we show a $\tilde{\Theta}(d^{\frac13}T^{\frac23})$ minimax regret up to logarithmic factors. For the latter model, we present an algorithm that achieves an $\tilde{O}(T^{\frac34})$ regret, and improve the best-known lower bound from $\Omega(T^{\frac35})$ to $\tilde{\Omega}(T^{\frac23})$. These results demonstrate that no-regret learning is possible for feature-based dynamic pricing under weak assumptions, but also reveal a disappointing fact that the seemingly richer pricing feedback is not significantly more useful than the bandit-feedback in regret reduction.
  • stat.ML updates on arXiv.org

    Fast Moving Natural Evolution Strategy for High-Dimensional Problems. (arXiv:2201.11422v1 [cs.NE])
    (2 min) In this work, we propose a new variant of natural evolution strategies (NES) for high-dimensional black-box optimization problems. The proposed method, CR-FM-NES, extends a recently proposed state-of-the-art NES, Fast Moving Natural Evolution Strategy (FM-NES), in order to be applicable in high-dimensional problems. CR-FM-NES builds on an idea using a restricted representation of a covariance matrix instead of using a full covariance matrix, while inheriting an efficiency of FM-NES. The restricted representation of the covariance matrix enables CR-FM-NES to update parameters of a multivariate normal distribution in linear time and space complexity, which can be applied to high-dimensional problems. Our experimental results reveal that CR-FM-NES does not lose the efficiency of FM-NES, and on the contrary, CR-FM-NES has achieved significant speedup compared to FM-NES on some benchmark problems. Furthermore, our numerical experiments using 200, 600, and 1000-dimensional benchmark problems demonstrate that CR-FM-NES is effective over scalable baseline methods, VD-CMA and Sep-CMA.
    Transfer Portal: Accurately Forecasting the Impact of a Player Transfer in Soccer. (arXiv:2201.11533v1 [cs.LG])
    (2 min) One of the most important and challenging problems in football is predicting future player performance when transferred to another club within and between different leagues. In addition to being the most valuable prediction a team makes, it is also the most complex analytics task to perform as it needs to take into consideration: a) differences in playing style between the player's current team and target team, b) differences in style and ability of other players on each team, c) differences in league quality and style, and d) the role the player is desired to play. In this paper, we present a method which addresses these issues and enables us to make accurate predictions of future performance. Our Transfer Portal model utilizes a personalized neural network accounting for both stylistic and ability level input representations for players, teams, and leagues to simulate future player performance at any chosen club. Furthermore, we use a Bayesian updating framework to dynamically modify player and team representations over time which enables us to generate predictions for rising stars with small amounts of data.
    As easy as APC: overcoming missing data and class imbalance in time series with self-supervised learning. (arXiv:2106.15577v5 [cs.LG] UPDATED)
    (2 min) High levels of missing data and strong class imbalance are ubiquitous challenges that are often presented simultaneously in real-world time series data. Existing methods approach these problems separately, frequently making significant assumptions about the underlying data generation process in order to lessen the impact of missing information. In this work, we instead demonstrate how a general self-supervised training method, namely Autoregressive Predictive Coding (APC), can be leveraged to overcome both missing data and class imbalance simultaneously without strong assumptions. Specifically, on a synthetic dataset, we show that standard baselines are substantially improved upon through the use of APC, yielding the greatest gains in the combined setting of high missingness and severe class imbalance. We further apply APC on two real-world medical time-series datasets, and show that APC improves the classification performance in all settings, ultimately achieving state-of-the-art AUPRC results on the Physionet benchmark.
    A Neighbourhood Framework for Resource-Lean Content Flagging. (arXiv:2103.17055v3 [cs.CL] UPDATED)
    (2 min) We propose a novel framework for cross-lingual content flagging with limited target-language data, which significantly outperforms prior work in terms of predictive performance. The framework is based on a nearest-neighbour architecture. It is a modern instantiation of the vanilla k-nearest neighbour model, as we use Transformer representations in all its components. Our framework can adapt to new source-language instances, without the need to be retrained from scratch. Unlike prior work on neighbourhood-based approaches, we encode the neighbourhood information based on query--neighbour interactions. We propose two encoding schemes and we show their effectiveness using both qualitative and quantitative analysis. Our evaluation results on eight languages from two different datasets for abusive language detection show sizable improvements of up to 9.5 F1 points absolute (for Italian) over strong baselines. On average, we achieve 3.6 absolute F1 points of improvement for the three languages in the Jigsaw Multilingual dataset and 2.14 points for the WUL dataset.
    Monitoring Model Deterioration with Explainable Uncertainty Estimation via Non-parametric Bootstrap. (arXiv:2201.11676v1 [cs.LG])
    (2 min) Monitoring machine learning models once they are deployed is challenging. It is even more challenging to decide when to retrain models in real-case scenarios when labeled data is beyond reach, and monitoring performance metrics becomes unfeasible. In this work, we use non-parametric bootstrapped uncertainty estimates and SHAP values to provide explainable uncertainty estimation as a technique that aims to monitor the deterioration of machine learning models in deployment environments, as well as determine the source of model deterioration when target labels are not available. Classical methods are purely aimed at detecting distribution shift, which can lead to false positives in the sense that the model has not deteriorated despite a shift in the data distribution. To estimate model uncertainty we construct prediction intervals using a novel bootstrap method, which improves upon the work of Kumar & Srivastava (2012). We show that both our model deterioration detection system as well as our uncertainty estimation method achieve better performance than the current state-of-the-art. Finally, we use explainable AI techniques to gain an understanding of the drivers of model deterioration. We release an open source Python package, doubt, which implements our proposed methods, as well as the code used to reproduce our experiments.
    Neuro-Symbolic Entropy Regularization. (arXiv:2201.11250v1 [cs.LG])
    (2 min) In structured prediction, the goal is to jointly predict many output variables that together encode a structured object -- a path in a graph, an entity-relation triple, or an ordering of objects. Such a large output space makes learning hard and requires vast amounts of labeled data. Different approaches leverage alternate sources of supervision. One approach -- entropy regularization -- posits that decision boundaries should lie in low-probability regions. It extracts supervision from unlabeled examples, but remains agnostic to the structure of the output space. Conversely, neuro-symbolic approaches exploit the knowledge that not every prediction corresponds to a valid structure in the output space. Yet, they does not further restrict the learned output distribution. This paper introduces a framework that unifies both approaches. We propose a loss, neuro-symbolic entropy regularization, that encourages the model to confidently predict a valid object. It is obtained by restricting entropy regularization to the distribution over only valid structures. This loss is efficiently computed when the output constraint is expressed as a tractable logic circuit. Moreover, it seamlessly integrates with other neuro-symbolic losses that eliminate invalid predictions. We demonstrate the efficacy of our approach on a series of semi-supervised and fully-supervised structured-prediction experiments, where we find that it leads to models whose predictions are more accurate and more likely to be valid.
    Towards Agnostic Feature-based Dynamic Pricing: Linear Policies vs Linear Valuation with Unknown Noise. (arXiv:2201.11341v1 [cs.LG])
    (2 min) In feature-based dynamic pricing, a seller sets appropriate prices for a sequence of products (described by feature vectors) on the fly by learning from the binary outcomes of previous sales sessions ("Sold" if valuation $\geq$ price, and "Not Sold" otherwise). Existing works either assume noiseless linear valuation or precisely-known noise distribution, which limits the applicability of those algorithms in practice when these assumptions are hard to verify. In this work, we study two more agnostic models: (a) a "linear policy" problem where we aim at competing with the best linear pricing policy while making no assumptions on the data, and (b) a "linear noisy valuation" problem where the random valuation is linear plus an unknown and assumption-free noise. For the former model, we show a $\tilde{\Theta}(d^{\frac13}T^{\frac23})$ minimax regret up to logarithmic factors. For the latter model, we present an algorithm that achieves an $\tilde{O}(T^{\frac34})$ regret, and improve the best-known lower bound from $\Omega(T^{\frac35})$ to $\tilde{\Omega}(T^{\frac23})$. These results demonstrate that no-regret learning is possible for feature-based dynamic pricing under weak assumptions, but also reveal a disappointing fact that the seemingly richer pricing feedback is not significantly more useful than the bandit-feedback in regret reduction.
    Change Detection of Markov Kernels with Unknown Post Change Kernel using Maximum Mean Discrepancy. (arXiv:2201.11722v1 [eess.SY])
    (2 min) In this paper, we develop a new change detection algorithm for detecting a change in the Markov kernel over a metric space in which the post-change kernel is unknown. Under the assumption that the pre- and post-change Markov kernel is geometrically ergodic, we derive an upper bound on the mean delay and a lower bound on the mean time between false alarms.
    Optimal transport weights for causal inference. (arXiv:2109.01991v3 [stat.ME] UPDATED)
    (2 min) Weighting methods are a common tool to de-bias estimates of causal effects. And though there are an increasing number of seemingly disparate methods, many of them can be folded into one unifying regime: Causal Optimal Transport. This new method directly targets distributional balance by minimizing optimal transport distances between treatment and control groups or, more generally, between a source and target population. Our approach is semiparametrically efficient and model-free but can also incorporate moments or any other important functions of covariates that the researcher desires to balance. We find that Causal Optimal Transport outperforms competitor methods when both the propensity score and outcome models are misspecified, indicating it is a robust alternative to common weighting methods. Finally, we demonstrate the utility of our method in an external control study examining the effect of misoprostol versus oxytocin for the treatment of post-partum hemorrhage.
    First-Order Regret in Reinforcement Learning with Linear Function Approximation: A Robust Estimation Approach. (arXiv:2112.03432v2 [cs.LG] UPDATED)
    (2 min) Obtaining first-order regret bounds -- regret bounds scaling not as the worst-case but with some measure of the performance of the optimal policy on a given instance -- is a core question in sequential decision-making. While such bounds exist in many settings, they have proven elusive in reinforcement learning with large state spaces. In this work we address this gap, and show that it is possible to obtain regret scaling as $\mathcal{O}(\sqrt{V_1^\star K})$ in reinforcement learning with large state spaces, namely the linear MDP setting. Here $V_1^\star$ is the value of the optimal policy and $K$ is the number of episodes. We demonstrate that existing techniques based on least squares estimation are insufficient to obtain this result, and instead develop a novel robust self-normalized concentration bound based on the robust Catoni mean estimator, which may be of independent interest.
    Likelihood Contribution based Multi-scale Architecture for Generative Flows. (arXiv:1908.01686v3 [cs.LG] UPDATED)
    (2 min) Deep generative modeling using flows has gained popularity owing to the tractable exact log-likelihood estimation with efficient training and synthesis process. However, flow models suffer from the challenge of having high dimensional latent space, the same in dimension as the input space. An effective solution to the above challenge as proposed by Dinh et al. (2016) is a multi-scale architecture, which is based on iterative early factorization of a part of the total dimensions at regular intervals. Prior works on generative flow models involving a multi-scale architecture perform the dimension factorization based on static masking. We propose a novel multi-scale architecture that performs data-dependent factorization to decide which dimensions should pass through more flow layers. To facilitate the same, we introduce a heuristic based on the contribution of each dimension to the total log-likelihood which encodes the importance of the dimensions. Our proposed heuristic is readily obtained as part of the flow training process, enabling the versatile implementation of our likelihood contribution based multi-scale architecture for generic flow models. We present such implementations for several state-of-the-art flow models and demonstrate improvements in log-likelihood score and sampling quality on standard image benchmarks. We also conduct ablation studies to compare the proposed method with other options for dimension factorization.
    Improved Complexities for Stochastic Conditional Gradient Methods under Interpolation-like Conditions. (arXiv:2006.08167v2 [math.OC] UPDATED)
    (2 min) We analyze stochastic conditional gradient methods for constrained optimization problems arising in over-parametrized machine learning. We show that one could leverage the interpolation-like conditions satisfied by such models to obtain improved oracle complexities. Specifically, when the objective function is convex, we show that the conditional gradient method requires $\mathcal{O}(\epsilon^{-2})$ calls to the stochastic gradient oracle to find an $\epsilon$-optimal solution. Furthermore, by including a gradient sliding step, we show that the number of calls reduces to $\mathcal{O}(\epsilon^{-1.5})$.
    The Implicit Bias of Benign Overfitting. (arXiv:2201.11489v1 [cs.LG])
    (2 min) The phenomenon of benign overfitting, where a predictor perfectly fits noisy training data while attaining low expected loss, has received much attention in recent years, but still remains not fully understood beyond simple linear regression setups. In this paper, we show that for regression, benign overfitting is "biased" towards certain types of problems, in the sense that its existence on one learning problem excludes its existence on other learning problems. On the negative side, we use this to argue that one should not expect benign overfitting to occur in general, for several natural extensions of the plain linear regression problems studied so far. We then turn to classification problems, and show that the situation there is much more favorable. Specifically, we consider a model where an arbitrary input distribution of some fixed dimension $k$ is concatenated with a high-dimensional distribution, and prove that the max-margin predictor (to which gradient-based methods are known to converge in direction) is asymptotically biased towards minimizing the expected *squared hinge loss* w.r.t. the $k$-dimensional distribution. This allows us to reduce the question of benign overfitting in classification to the simpler question of whether this loss is a good surrogate for the prediction error, and use it to show benign overfitting in some new settings.
    Online Change Point Detection for Weighted and Directed Random Dot Product Graphs. (arXiv:2201.11222v1 [cs.LG])
    (2 min) Given a sequence of random (directed and weighted) graphs, we address the problem of online monitoring and detection of changes in the underlying data distribution. Our idea is to endow sequential change-point detection (CPD) techniques with a graph representation learning substrate based on the versatile Random Dot Product Graph (RDPG) model. We consider efficient, online updates of a judicious monitoring function, which quantifies the discrepancy between the streaming graph observations and the nominal RDPG. This reference distribution is inferred via spectral embeddings of the first few graphs in the sequence. We characterize the distribution of this running statistic to select thresholds that guarantee error-rate control, and under simplifying approximations we offer insights on the algorithm's detection resolution and delay. The end result is a lightweight online CPD algorithm, that is also explainable by virtue of the well-appreciated interpretability of RDPG embeddings. This is in stark contrast with most existing graph CPD approaches, which either rely on extensive computation, or they store and process the entire observed time series. An apparent limitation of the RDPG model is its suitability for undirected and unweighted graphs only, a gap we aim to close here to broaden the scope of the CPD framework. Unlike previous proposals, our non-parametric RDPG model for weighted graphs does not require a priori specification of the weights' distribution to perform inference and estimation. This network modeling contribution is of independent interest beyond CPD. We offer an open-source implementation of the novel online CPD algorithm for weighted and direct graphs, whose effectiveness and efficiency are demonstrated via (reproducible) synthetic and real network data experiments.
    Logic of Machine Learning. (arXiv:2006.09500v4 [cs.LG] UPDATED)
    (2 min) The main question is: why and how can we ever predict based on a finite sample? The question is not answered by statistical learning theory. Here, I suggest that prediction requires belief in "predictability" of the underlying dependence, and learning involves search for a hypothesis where these beliefs are violated the least given the observations. The measure of these violations ("errors") for given data, hypothesis and particular type of predictability beliefs is formalized as concept of incongruity in modal Logic of Observations and Hypotheses (LOH). I show on examples of many popular textbook learners (from hierarchical clustering to k-NN and SVM) that each of them minimizes its own version of incongruity. In addition, the concept of incongruity is shown to be flexible enough for formalization of some important data analysis problems, not considered as part of ML.
    Generalized Matrix Factorization: efficient algorithms for fitting generalized linear latent variable models to large data arrays. (arXiv:2010.02469v3 [cs.LG] UPDATED)
    (2 min) Unmeasured or latent variables are often the cause of correlations between multivariate measurements, which are studied in a variety of fields such as psychology, ecology, and medicine. For Gaussian measurements, there are classical tools such as factor analysis or principal component analysis with a well-established theory and fast algorithms. Generalized Linear Latent Variable models (GLLVMs) generalize such factor models to non-Gaussian responses. However, current algorithms for estimating model parameters in GLLVMs require intensive computation and do not scale to large datasets with thousands of observational units or responses. In this article, we propose a new approach for fitting GLLVMs to high-dimensional datasets, based on approximating the model using penalized quasi-likelihood and then using a Newton method and Fisher scoring to learn the model parameters. Computationally, our method is noticeably faster and more stable, enabling GLLVM fits to much larger matrices than previously possible. We apply our method on a dataset of 48,000 observational units with over 2,000 observed species in each unit and find that most of the variability can be explained with a handful of factors. We publish an easy-to-use implementation of our proposed fitting algorithm.
    Implicit Regularization in Hierarchical Tensor Factorization and Deep Convolutional Neural Networks. (arXiv:2201.11729v1 [cs.LG])
    (2 min) In the pursuit of explaining implicit regularization in deep learning, prominent focus was given to matrix and tensor factorizations, which correspond to simplified neural networks. It was shown that these models exhibit implicit regularization towards low matrix and tensor ranks, respectively. Drawing closer to practical deep learning, the current paper theoretically analyzes the implicit regularization in hierarchical tensor factorization, a model equivalent to certain deep convolutional neural networks. Through a dynamical systems lens, we overcome challenges associated with hierarchy, and establish implicit regularization towards low hierarchical tensor rank. This translates to an implicit regularization towards locality for the associated convolutional networks. Inspired by our theory, we design explicit regularization discouraging locality, and demonstrate its ability to improve performance of modern convolutional networks on non-local tasks, in defiance of conventional wisdom by which architectural changes are needed. Our work highlights the potential of enhancing neural networks via theoretical analysis of their implicit regularization.
    Exploring the Geometry and Topology of Neural Network Loss Landscapes. (arXiv:2102.00485v2 [cs.LG] UPDATED)
    (2 min) Recent work has established clear links between the generalization performance of trained neural networks and the geometry of their loss landscape near the local minima to which they converge. This suggests that qualitative and quantitative examination of the loss landscape geometry could yield insights about neural network generalization performance during training. To this end, researchers have proposed visualizing the loss landscape through the use of simple dimensionality reduction techniques. However, such visualization methods have been limited by their linear nature and only capture features in one or two dimensions, thus restricting sampling of the loss landscape to lines or planes. Here, we expand and improve upon these in three ways. First, we present a novel "jump and retrain" procedure for sampling relevant portions of the loss landscape. We show that the resulting sampled data holds more meaningful information about the network's ability to generalize. Next, we show that non-linear dimensionality reduction of the jump and retrain trajectories via PHATE, a trajectory and manifold-preserving method, allows us to visualize differences between networks that are generalizing well vs poorly. Finally, we combine PHATE trajectories with a computational homology characterization to quantify trajectory differences.
    Critical Initialization of Wide and Deep Neural Networks through Partial Jacobians: General Theory and Applications. (arXiv:2111.12143v3 [cs.LG] UPDATED)
    (2 min) Deep neural networks are notorious for defying theoretical treatment. However, when the number of parameters in each layer tends to infinity the network function is a Gaussian process (GP) and quantitatively predictive description is possible. Gaussian approximation allows to formulate criteria for selecting hyperparameters, such as variances of weights and biases, as well as the learning rate. These criteria rely on the notion of criticality defined for deep neural networks. In this work we describe a new practical way to diagnose criticality. We introduce \emph{partial Jacobians} of a network, defined as derivatives of preactivations in layer $l$ with respect to preactivations in layer $l_0\leq l$. We derive recurrence relations for the norms of partial Jacobians and utilize these relations to analyze criticality of deep fully connected neural networks with LayerNorm and/or residual connections. We derive and implement a simple and cheap numerical test that allows to select optimal initialization for a broad class of deep neural networks. Using these tools we show quantitatively that proper stacking of the LayerNorm (applied to preactivations) and residual connections leads to an architecture that is critical for any initialization. Finally, we apply our methods to analyze the MLP-Mixer architecture and show that it is everywhere critical.
    Advanced Stationary and Non-Stationary Kernel Designs for Domain-Aware Gaussian Processes. (arXiv:2102.03432v3 [stat.ML] UPDATED)
    (2 min) Gaussian process regression is a widely-applied method for function approximation and uncertainty quantification. The technique has gained popularity recently in the machine learning community due to its robustness and interpretability. The mathematical methods we discuss in this paper are an extension of the Gaussian-process framework. We are proposing advanced kernel designs that only allow for functions with certain desirable characteristics to be elements of the reproducing kernel Hilbert space (RKHS) that underlies all kernel methods and serves as the sample space for Gaussian process regression. These desirable characteristics reflect the underlying physics; two obvious examples are symmetry and periodicity constraints. In addition, non-stationary kernel designs can be defined in the same framework to yield flexible multi-task Gaussian processes. We will show the impact of advanced kernel designs on Gaussian processes using several synthetic and two scientific data sets. The results show that including domain knowledge, communicated through advanced kernel designs, has a significant impact on the accuracy and relevance of the function approximation.
    Soft BIBD and Product Gradient Codes. (arXiv:2105.05231v2 [cs.IT] UPDATED)
    (2 min) Gradient coding is a coding theoretic framework to provide robustness against slow or unresponsive machines, known as stragglers, in distributed machine learning applications. Recently, Kadhe et al. proposed a gradient code based on a combinatorial design, called balanced incomplete block design (BIBD), which is shown to outperform many existing gradient codes in worst-case adversarial straggling scenarios. However, parameters for which such BIBD constructions exist are very limited. In this paper, we aim to overcome such limitations and construct gradient codes which exist for a wide range of system parameters while retaining the superior performance of BIBD gradient codes. Two such constructions are proposed, one based on a probabilistic construction that relax the stringent BIBD gradient code constraints, and the other based on taking the Kronecker product of existing gradient codes. The proposed gradient codes allow flexible choices of system parameters while retaining comparable error performance.
    Benchmarking learned non-Cartesian k-space trajectories and reconstruction networks. (arXiv:2201.11356v1 [cs.LG])
    (2 min) We benchmark the current existing methods to jointly learn non-Cartesian k-space trajectory and reconstruction: PILOT, BJORK, and compare them with those obtained from the recently developed generalized hybrid learning (HybLearn) framework. We present the advantages of using projected gradient descent to enforce MR scanner hardware constraints as compared to using added penalties in the cost function. Further, we use the novel HybLearn scheme to jointly learn and compare our results through a retrospective study on fastMRI validation dataset.
    Reward-Free RL is No Harder Than Reward-Aware RL in Linear Markov Decision Processes. (arXiv:2201.11206v1 [cs.LG])
    (2 min) Reward-free reinforcement learning (RL) considers the setting where the agent does not have access to a reward function during exploration, but must propose a near-optimal policy for an arbitrary reward function revealed only after exploring. In the the tabular setting, it is well known that this is a more difficult problem than PAC RL -- where the agent has access to the reward function during exploration -- with optimal sample complexities in the two settings differing by a factor of $|\mathcal{S}|$, the size of the state space. We show that this separation does not exist in the setting of linear MDPs. We first develop a computationally efficient algorithm for reward-free RL in a $d$-dimensional linear MDP with sample complexity scaling as $\mathcal{O}(d^2/\epsilon^2)$. We then show a matching lower bound of $\Omega(d^2/\epsilon^2)$ on PAC RL. To our knowledge, our approach is the first computationally efficient algorithm to achieve optimal $d$ dependence in linear MDPs, even in the single-reward PAC setting. Our algorithm relies on a novel procedure which efficiently traverses a linear MDP, collecting samples in any given "feature direction", and enjoys a sample complexity scaling optimally in the (linear MDP equivalent of the) maximal state visitation probability. We show that this exploration procedure can also be applied to solve the problem of obtaining "well-conditioned" covariates in linear MDPs.
    Learning Mixtures of Linear Dynamical Systems. (arXiv:2201.11211v1 [stat.ML])
    (2 min) We study the problem of learning a mixture of multiple linear dynamical systems (LDSs) from unlabeled short sample trajectories, each generated by one of the LDS models. Despite the wide applicability of mixture models for time-series data, learning algorithms that come with end-to-end performance guarantees are largely absent from existing literature. There are multiple sources of technical challenges, including but not limited to (1) the presence of latent variables (i.e. the unknown labels of trajectories); (2) the possibility that the sample trajectories might have lengths much smaller than the dimension $d$ of the LDS models; and (3) the complicated temporal dependence inherent to time-series data. To tackle these challenges, we develop a two-stage meta-algorithm, which is guaranteed to efficiently recover each ground-truth LDS model up to error $\tilde{O}(\sqrt{d/T})$, where $T$ is the total sample size. We validate our theoretical studies with numerical experiments, confirming the efficacy of the proposed algorithm.
    Fairness implications of encoding protected categorical attributes. (arXiv:2201.11358v1 [cs.LG])
    (2 min) Protected attributes are often presented as categorical features that need to be encoded before feeding them into a machine learning algorithm. Encoding these attributes is paramount as they determine the way the algorithm will learn from the data. Categorical feature encoding has a direct impact on the model performance and fairness. In this work, we compare the accuracy and fairness implications of the two most well-known encoders: one-hot encoding and target encoding. We distinguish between two types of induced bias that can arise while using these encodings and can lead to unfair models. The first type, irreducible bias, is due to direct group category discrimination and a second type, reducible bias, is due to large variance in less statistically represented groups. We take a deeper look into how regularization methods for target encoding can improve the induced bias while encoding categorical features. Furthermore, we tackle the problem of intersectional fairness that arises when mixing two protected categorical features leading to higher cardinality. This practice is a powerful feature engineering technique used for boosting model performance. We study its implications on fairness as it can increase both types of induced bias
    MQTransformer: Multi-Horizon Forecasts with Context Dependent and Feedback-Aware Attention. (arXiv:2009.14799v4 [cs.LG] UPDATED)
    (0 min) Recent advances in neural forecasting have produced major improvements in accuracy for probabilistic demand prediction. In this work, we propose novel improvements to the current state of the art by incorporating changes inspired by recent advances in Transformer architectures for Natural Language Processing. We develop a novel decoder-encoder attention for context-alignment, improving forecasting accuracy by allowing the network to study its own history based on the context for which it is producing a forecast. We also present a novel positional encoding that allows the neural network to learn context-dependent seasonality functions as well as arbitrary holiday distances. Finally we show that the current state of the art MQ-Forecaster (Wen et al., 2017) models display excess variability by failing to leverage previous errors in the forecast to improve accuracy. We propose a novel decoder-self attention scheme for forecasting that produces significant improvements in the excess variation of the forecast.
    Can we Generalize and Distribute Private Representation Learning?. (arXiv:2010.01792v4 [cs.LG] UPDATED)
    (0 min) We study the problem of learning representations that are private yet informative i.e., provide information about intended "ally" targets while hiding sensitive "adversary" attributes). We propose Exclusion-Inclusion Generative Adversarial Network (EIGAN), a generalized private representation learning (PRL) architecture that accounts for multiple ally and adversary attributes unlike existing PRL solutions. While centrally-aggregated dataset is a prerequisite for most PRL techniques, data in real-world is often siloed across multiple distributed nodes unwilling to share the raw data because of privacy concerns. We address this practical constraint by developing D-EIGAN, the first distributed PRL method that learns representations at each node without transmitting the source data. We theoretically analyze the behavior of adversaries under the optimal EIGAN and D-EIGAN encoders and the impact of dependencies among ally and adversary tasks on the optimization objective. Our experiments on various datasets demonstrate the advantages of EIGAN in terms of performance, robustness, and scalability. In particular, EIGAN outperforms the previous state-of-the-art by a significant accuracy margin (47% improvement), and D-EIGAN's performance is consistently on par with EIGAN under different network settings.
    Multi-view learning with privileged weighted twin support vector machine. (arXiv:2201.11306v1 [stat.ML])
    (0 min) Weighted twin support vector machines (WLTSVM) mines as much potential similarity information in samples as possible to improve the common short-coming of non-parallel plane classifiers. Compared with twin support vector machines (TWSVM), it reduces the time complexity by deleting the superfluous constraints using the inter-class K-Nearest Neighbor (KNN). Multi-view learning (MVL) is a newly developing direction of machine learning, which focuses on learning acquiring information from the data indicated by multiple feature sets. In this paper, we propose multi-view learning with privileged weighted twin support vector machines (MPWTSVM). It not only inherits the advantages of WLTSVM but also has its characteristics. Firstly, it enhances generalization ability by mining intra-class information from the same perspective. Secondly, it reduces the redundancy constraints with the help of inter-class information, thus improving the running speed. Most importantly, it can follow both the consensus and the complementarity principle simultaneously as a multi-view classification model. The consensus principle is realized by minimizing the coupling items of the two views in the original objective function. The complementary principle is achieved by establishing privileged information paradigms and MVL. A standard quadratic programming solver is used to solve the problem. Compared with multi-view classification models such as SVM-2K, MVTSVM, MCPK, and PSVM-2V, our model has better accuracy and classification efficiency. Experimental results on 45 binary data sets prove the effectiveness of our method.
    Born-Infeld (BI) for AI: Energy-Conserving Descent (ECD) for Optimization. (arXiv:2201.11137v1 [cs.LG])
    (2 min) We introduce a novel framework for optimization based on energy-conserving Hamiltonian dynamics in a strongly mixing (chaotic) regime and establish its key properties analytically and numerically. The prototype is a discretization of Born-Infeld dynamics, with a squared relativistic speed limit depending on the objective function. This class of frictionless, energy-conserving optimizers proceeds unobstructed until slowing naturally near the minimal loss, which dominates the phase space volume of the system. Building from studies of chaotic systems such as dynamical billiards, we formulate a specific algorithm with good performance on machine learning and PDE-solving tasks, including generalization. It cannot stop at a high local minimum and cannot overshoot the global minimum, yielding an advantage in non-convex loss functions, and proceeds faster than GD+momentum in shallow valleys.
    Self-Certifying Classification by Linearized Deep Assignment. (arXiv:2201.11162v1 [stat.ML])
    (2 min) We propose a novel class of deep stochastic predictors for classifying metric data on graphs within the PAC-Bayes risk certification paradigm. Classifiers are realized as linearly parametrized deep assignment flows with random initial conditions. Building on the recent PAC-Bayes literature and data-dependent priors, this approach enables (i) to use risk bounds as training objectives for learning posterior distributions on the hypothesis space and (ii) to compute tight out-of-sample risk certificates of randomized classifiers more efficiently than related work. Comparison with empirical test set errors illustrates the performance and practicality of this self-certifying classification method.
  • Data Science Central

    How Solid Pods May End Up Becoming the Building Blocks of the Metaverse
    (12 min) Tim Berners-Lee has an interesting habit of coming up with ideas that seem hard to explain at the outset, remain all hard to understand even as they become more implemented and refined, can go for years with only a few die-hard fans becoming convinced that what he is doing is the best thing since sliced… Read More »How Solid Pods May End Up Becoming the Building Blocks of the Metaverse The post How Solid Pods May End Up Becoming the Building Blocks of the Metaverse appeared first on Data Science Central.
  • Artificial Intelligence

    Artificial Nightmares : Forgive Me Tyrael || DD PYTTI VQGAN AI Art Video [4K 60 FPS]
    (0 min) submitted by /u/Thenamessd [link] [comments]
    Can you think of a field AI won't impact?
    (1 min) I was discussing with a friend how AI will change every single aspects of our lives and realized that I was not able to come up with even a single field that will not be impacted by AI. Do you have any idea? submitted by /u/Krellb [link] [comments]
    2020s. A short story by GPT-3
    (1 min) (Prompt - Generated Text) Make up a short story/anecdote/poem or whatever in which you talk about the world of the 2020s - and whether there is still hope. 2020s The world was once a beautiful and vibrant place, but now it is just a shadow of its former self. People are lethargic and apathetic, and the only thing that seems to bring them joy is meaningless entertainment and distraction. The forests are empty and the animals have disappeared. The rivers and lakes are polluted and the air is poisonous. The only things that seem to thrive are disease and poverty. And it looks like everything is only going to get worse. People are so immersed in their own misery that they don't even realize it's getting worse. They are content to sit in their own filth and let the world die around them. The only thing that seems to make them happy is the false hope that things will get better. But they don't. And people are getting sadder and more lethargic. But there is still hope. There is still a chance for people to change things. They just need to open their eyes and see the world for what it is. They need to realize that they are the ones who can make a difference. And when they finally do, they will transform the world into a place of beauty and happiness. What is the conclusion of this story? The conclusion is that whether on the planet or in your very own mental world, you need to take action. Lethargy means stagnation and a creeping but certain death. submitted by /u/BeginningInfluence55 [link] [comments]
    What are Hypernetworks where AI Builds AI?
    (0 min) submitted by /u/Beautiful-Credit-868 [link] [comments]
    Realistic AI Face Editing in Videos ! GAN-based face manipulations in videos: Stitch it in Time explained
    (1 min) submitted by /u/OnlyProggingForFun [link] [comments]
    The ___________ approach uses the knowledge of mathematics and engineering.
    (0 min) submitted by /u/futureanalytica [link] [comments]
    In what areas is China leading the AI race (if any) Also have they come up with anything original themselves or have they merely improved on what the west is already doing?
    (2 min) ​ In what areas is China leading the AI race (if any) Also have they come up with anything original themselves or have they merely improved on what the west is already doing? ​ Or are they currently behind in every area/most areas? submitted by /u/Fun_Sun_97 [link] [comments]
    Analysis of new advanced enemy AI game design position opening up for the upcoming Riot MMO
    (2 min) The principal game designer for the Riot MMO just posted a new position for developing an enemy AI system. https://twitter.com/Ghostcrawler retweets: Are you a highly experienced AI designer? Are you a subject matter expert at building or contributing to the foundations of an enemy AI system? Do the words "Behavior Tree" excite you? Let's chat :) According to the job description When you take decision trees and AI into account, the two things that stand out the most to me are: Experience designing friendly or non combat AI This "friendly/non combat AI" aspect makes it seem like it will be heavily story based with decisions affecting what happens in regards to your progression or access to new areas/regions/resources/gear. (Kind of like Detroit become human). and: Expe…
    Artificial Intelligence Could Help Detect Onset of Cardiovascular Disease
    (0 min) submitted by /u/Beautiful-Credit-868 [link] [comments]
  • Machine Learning

    Why are ‘tree based’ models discussed less in optimization/statistical ML contexts than general data science contexts? [D]
    (2 min) None of the statistics or statistical ML classes I’ve taken have given significant lip service to tree based models, but in the in 1 DS boot camp I did, they were mentioned extensively. Is there something to this anecdote, or am I just extrapolating from to small a sample? Why might classically trained statisticians and ‘ ML research ‘ people be less interested in these models than data scientists? submitted by /u/Greenface1998 [link] [comments]
    Stylegan-nada generated filters using only javascript https://filter.mot.omg.lol [P]
    (0 min) submitted by /u/FamiliarGift2237 [link] [comments]
    [D] First Author Interview - Deep Symbolic Regression for Recurrent Sequences (Paper Explained Video & Interview)
    (1 min) https://youtu.be/1HEdXwEYrGM This video includes an interview with first author Stéphane d'Ascoli (https://sdascoli.github.io/). Deep neural networks are typically excellent at numeric regression, but using them for symbolic computation has largely been ignored so far. This paper uses transformers to do symbolic regression on integer and floating point number sequences, which means that given the start of a sequence of numbers, the model has to not only predict the correct continuation, but also predict the data generating formula behind the sequence. Through clever encoding of the input space and a well constructed training data generation process, this paper's model can learn and represent many of the sequences in the OEIS, the online encyclopedia of integer sequences and it also features an interactive demo if you want to try it by yourself. ​ OUTLINE: 0:00 - Introduction 2:20 - Summary of the Paper 16:10 - Start of Interview 17:15 - Why this research direction? 20:45 - Overview of the method 30:10 - Embedding space of input tokens 33:00 - Data generation process 42:40 - Why are transformers useful here? 46:40 - Beyond number sequences, where is this useful? 48:45 - Success cases and failure cases 58:10 - Experimental Results 1:06:30 - How did you overcome difficulties? 1:09:25 - Interactive demo ​ Paper: https://arxiv.org/abs/2201.04600 Interactive demo: http://recur-env.eba-rm3fchmn.us-east-2.elasticbeanstalk.com/ submitted by /u/ykilcher [link] [comments]
    [R][N] Camera Trap photo analysis at the Chinko Nature Reserve using Convolutional Neural Networks
    (0 min) submitted by /u/Danny_Arends [link] [comments]
    [R] PyTorch Implementation of the Natural Posterior Network
    (1 min) While I am sure that you will find our spotlight paper "Natural Posterior Network: Deep Bayesian Uncertainty for Exponential Family Distributions" at ICLR 2022, I want to take this opportunity to highlight our publicly available PyTorch implementation. Natural Posterior Network (NatPN) provides fast and high-quality uncertainty estimates and can be used for any problem where the target distribution belongs to the exponential family. Most notably, this includes classification, regression, and count prediction tasks. Importantly, NatPN requires little changes to existing architectures and produces uncertainty estimates without the need for out-of-distribution data. Thus, chances are good that you can easily extend your standard deep learning models with the NatPN architecture. Therefore, we put serious effort in the publicly available implementation to facilitate usage of NatPN: we (1) provide an intuitive interface that enables using the model as easily as Scikit-learn estimators and (2) follow a modular design that allows you to customize and build upon the model at different levels of abstraction. Check it out on GitHub! submitted by /u/borchero [link] [comments]
    [D] What are some graph data augmentations?
    (1 min) Hello! For those of you who do Graph Deep Learning or inference on graphs, what are some good data augmentations you use on your graph data? So far, there's only one paper (https://arxiv.org/pdf/2006.06830.pdf) that documents something like this and it mentions stuff like dropping nodes, dropping edges, perturbing features, etc. What are some others you've come across / practice in your work? And do your libraries (like GDL, PyG, etc.) come with such augmentations in the form of methods or do you have to write a bit of extra code? I'm performing some experiments the next couple days and was hoping to perform data augmentations on the data to improve training, as is done with regular images. Appreciate your help! submitted by /u/UniqueCut5181 [link] [comments]
    [P] Autonomous Detection and Deterrence of Pigeons on Buildings by Drones
    (0 min) submitted by /u/Tintin_Quarentino [link] [comments]
    [D] Creating fixed length embedding for variable length sequences?
    (2 min) Task: node classification of graphs where nodes features are sentences of varying length. In what ways can I transform these sentences into a fixed length vector? submitted by /u/l34df4rm3r [link] [comments]
    [P] WebtoonMe Project: Selfie to Webtoon style
    (2 min) submitted by /u/Illustrious_Row_9971 [link] [comments]
    [D] Bulk training set converter?
    (1 min) Is there any bulk image tool so that I can convert my training set for StyleGAN2 all to rgb format? I have been looking all over for a tool like this, if someone knows of any I think it would help a lot of people. :) submitted by /u/Rejected_Crafter [link] [comments]
    [D] Training Neural networks on AWS EC2
    (1 min) Can anyone share a workflow for setting training neural nets on AWS. I have already created my AMI for the instance I will be using. I am just looking for some additional info and help on cloning repos and then running main scripts and if I want to save checkpoints what directory should I put them in so I can reinitialize a model. I would preferably avoid using Docker if at all possible but if that is the only way if someone has a good resource on this that would be great. Any help is really appreciated thanks! submitted by /u/AbjectDrink3276 [link] [comments]
  • Neural Networks, Deep Learning and Machine Learning

    Researchers Introduce ‘SimpleBits’: An Information-Reduction Strategy That Learns To Synthesize Simplified Inputs For Neural Network Understanding
    (1 min) The practice of identifying and labeling groups of pixels or vectors inside an image based on specific rules is known as image classification. One or more spectral or textural properties can be used to create the classification law. Scientists believe that a deeper knowledge of how deep networks learn information can lead to new scientific discoveries, help us better comprehend the distinctions between human and model behavior, and serve as valuable auditing tools. According to studies, it is essential to understand how neural network-based image classifiers would react to increasingly simple inputs. For this, a clear measure of input simplicity, an optimization goal that correlates with simplification, and a framework to incorporate such a goal into training and inference are required. Continue Reading Paper: https://arxiv.org/pdf/2201.05610v1.pdf https://preview.redd.it/9xcmnh9grje81.png?width=1334&format=png&auto=webp&s=a18753ee5ef620b248db6d7d622acc23e39b2695 submitted by /u/ai-lover [link] [comments]
  • Reinforcement Learning

    Finding Classic MARL Algo Implementations
    (1 min) submitted by /u/rower22 [link] [comments]
    Looking for an online internship program on reinforcement learning
    (1 min) Greetings everyone, this is my first post here. I'm a self learner on reinforcement learning. I'm really searching for a good online platform that renders internship program on reinforcment learner so that I can really enhance my skill and visualize it in its real form. I will be pleased if anyone can point me in the right direction. Thanks submitted by /u/StephenAnyanwu [link] [comments]
    Gym environments using Godot engine
    (1 min) Hello, there are already Unreal and Unity engines bindings for python. I just made my own for Godot engine: https://github.com/lupoglaz/GodotAIGym And also recorded tutorials on how to code LunarLander environment and train an agent: https://youtu.be/qolRDx1Q_TQ - Installation https://youtu.be/ahtcw8hzHSQ - Basic environment https://youtu.be/fvqxjF2KsTM - Agent https://youtu.be/ADpm1v2Jlqo - Ground https://youtu.be/uqqGctJrqWw - Observation and reward https://youtu.be/6IgR34Ys0io - Training AI and deploying it Why Godot Engine? First, it's extremely lightweight, just under 100mb executable for the editor. Second, no external dependencies, Linux first code, so deploying your environment on a SLURM cluster is much less painful, then for Unity or Unreal based environments. Next, it's rather easy to execute the engine faster than real time, easy to turn off the renderer at the compile time, it has great profiler tools. Hope you'll find this side-project of mine handy. submitted by /u/clueless_scientist [link] [comments]

2022-01-28

  • cs.LG updates on arXiv.org

    Recency Dropout for Recurrent Recommender Systems. (arXiv:2201.11016v1 [cs.IR])
    (2 min) Recurrent recommender systems have been successful in capturing the temporal dynamics in users' activity trajectories. However, recurrent neural networks (RNNs) are known to have difficulty learning long-term dependencies. As a consequence, RNN-based recommender systems tend to overly focus on short-term user interests. This is referred to as the recency bias, which could negatively affect the long-term user experience as well as the health of the ecosystem. In this paper, we introduce the recency dropout technique, a simple yet effective data augmentation technique to alleviate the recency bias in recurrent recommender systems. We demonstrate the effectiveness of recency dropout in various experimental settings including a simulation study, offline experiments, as well as live experiments on a large-scale industrial recommendation platform.
    CsFEVER and CTKFacts: Czech Datasets for Fact Verification. (arXiv:2201.11115v1 [cs.CL])
    (2 min) In this paper we present two Czech datasets aimed for training automated fact-checking machine learning models. Specifically we deal with the task of assessment of a textual claim veracity w.r.t. to a (presumably) verified corpus. The output of the system is the claim classification SUPPORTS or REFUTES complemented with evidence documents or NEI (Not Enough Info) alone. In the first place we publish CsFEVER of approximately 112k claims which is an automatically generated Czech version of the well-known Wikipedia-based FEVER dataset. We took a hybrid approach of machine translation and language alignment, where the same method (and tools we provide) can be easily applied to other languages. The second dataset CTKFacts of 3,097 claims is built on the corpus of approximately two million Czech News Agency news reports. We present an extended methodology based on the FEVER approach. Most notably, we describe a method to automatically generate wider claim contexts (dictionaries) for non-hyperlinked corpora. The datasets are analyzed for spurious cues, which are annotation patterns leading to model overfitting. CTKFacts is further examined for inter-annotator agreement, and a typology of common annotator errors is extracted. Finally, we provide baseline models for all stages of the fact-checking pipeline.
    The edge of chaos: quantum field theory and deep neural networks. (arXiv:2109.13247v2 [hep-th] UPDATED)
    (2 min) We explicitly construct the quantum field theory corresponding to a general class of deep neural networks encompassing both recurrent and feedforward architectures. We first consider the mean-field theory (MFT) obtained as the leading saddlepoint in the action, and derive the condition for criticality via the largest Lyapunov exponent. We then compute the loop corrections to the correlation function in a perturbative expansion in the ratio of depth $T$ to width $N$, and find a precise analogy with the well-studied $O(N)$ vector model, in which the variance of the weight initializations plays the role of the 't Hooft coupling. In particular, we compute both the $\mathcal{O}(1)$ corrections quantifying fluctuations from typicality in the ensemble of networks, and the subleading $\mathcal{O}(T/N)$ corrections due to finite-width effects. These provide corrections to the correlation length that controls the depth to which information can propagate through the network, and thereby sets the scale at which such networks are trainable by gradient descent. Our analysis provides a first-principles approach to the rapidly emerging NN-QFT correspondence, and opens several interesting avenues to the study of criticality in deep neural networks.
    DSFormer: A Dual-domain Self-supervised Transformer for Accelerated Multi-contrast MRI Reconstruction. (arXiv:2201.10776v1 [eess.IV])
    (2 min) Multi-contrast MRI (MC-MRI) captures multiple complementary imaging modalities to aid in radiological decision-making. Given the need for lowering the time cost of multiple acquisitions, current deep accelerated MRI reconstruction networks focus on exploiting the redundancy between multiple contrasts. However, existing works are largely supervised with paired data and/or prohibitively expensive fully-sampled MRI sequences. Further, reconstruction networks typically rely on convolutional architectures which are limited in their capacity to model long-range interactions and may lead to suboptimal recovery of fine anatomical detail. To these ends, we present a dual-domain self-supervised transformer (DSFormer) for accelerated MC-MRI reconstruction. DSFormer develops a deep conditional cascade transformer (DCCT) consisting of several cascaded Swin transformer reconstruction networks (SwinRN) trained under two deep conditioning strategies to enable MC-MRI information sharing. We further present a dual-domain (image and k-space) self-supervised learning strategy for DCCT to alleviate the costs of acquiring fully sampled training data. DSFormer generates high-fidelity reconstructions which experimentally outperform current fully-supervised baselines. Moreover, we find that DSFormer achieves nearly the same performance when trained either with full supervision or with our proposed dual-domain self-supervision.
    Privacy-Preserving Logistic Regression Training with a Faster Gradient Variant. (arXiv:2201.10838v1 [cs.CR])
    (2 min) Logistic regression training on an encrypted dataset has been an attractive idea to security concerns for years. In this paper, we propose a faster gradient variant called Quadratic Gradient for logistic regression and implement it via a special homomorphic encryption scheme. The core of this gradient variant can be seen as an extension of the simplified fixed Hessian from Newton's method, which extracts information from the Hessian matrix into the naive gradient, and thus can be used to enhance Nesterov's accelerated gradient (NAG), Adagrad, etc. We evaluate various gradient $ascent$ methods with this gradient variant on the gene dataset provided by the 2017 iDASH competition and the image dataset from the MNIST database. Experimental results show that the enhanced methods converge faster and sometimes even to a better convergence result. We also implement the gradient variant in full batch NAG and mini-batch NAG for training a logistic regression model on a large dataset in the encrypted domain. Equipped with this gradient variant, full batch NAG and mini-batch NAG are both faster than the original ones.
    Machine Learning for Food Review and Recommendation. (arXiv:2201.10978v1 [cs.IR])
    (2 min) Food reviews and recommendations have always been important for online food service websites. However, reviewing and recommending food is not simple as it is likely to be overwhelmed by disparate contexts and meanings. In this paper, we use different deep learning approaches to address the problems of sentiment analysis, automatic review tag generation, and retrieval of food reviews. We propose to develop a web-based food review system at Nanyang Technological University (NTU) named NTU Food Hunter, which incorporates different deep learning approaches that help users with food selection. First, we implement the BERT and LSTM deep learning models into the system for sentiment analysis of food reviews. Then, we develop a Part-of-Speech (POS) algorithm to automatically identify and extract adjective-noun pairs from the review content for review tag generation based on POS tagging and dependency parsing. Finally, we also train a RankNet model for the re-ranking of the retrieval results to improve the accuracy in our Solr-based food reviews search system. The experimental results show that our proposed deep learning approaches are promising for the applications of real-world problems.
    An Explainable Artificial Intelligence Framework for Quality-Aware IoE Service Delivery. (arXiv:2201.10822v1 [cs.AI])
    (0 min) One of the core envisions of the sixth-generation (6G) wireless networks is to accumulate artificial intelligence (AI) for autonomous controlling of the Internet of Everything (IoE). Particularly, the quality of IoE services delivery must be maintained by analyzing contextual metrics of IoE such as people, data, process, and things. However, the challenges incorporate when the AI model conceives a lake of interpretation and intuition to the network service provider. Therefore, this paper provides an explainable artificial intelligence (XAI) framework for quality-aware IoE service delivery that enables both intelligence and interpretation. First, a problem of quality-aware IoE service delivery is formulated by taking into account network dynamics and contextual metrics of IoE, where the objective is to maximize the channel quality index (CQI) of each IoE service user. Second, a regression problem is devised to solve the formulated problem, where explainable coefficients of the contextual matrices are estimated by Shapley value interpretation. Third, the XAI-enabled quality-aware IoE service delivery algorithm is implemented by employing ensemble-based regression models for ensuring the interpretation of contextual relationships among the matrices to reconfigure network parameters. Finally, the experiment results show that the uplink improvement rate becomes 42.43% and 16.32% for the AdaBoost and Extra Trees, respectively, while the downlink improvement rate reaches up to 28.57% and 14.29%. However, the AdaBoost-based approach cannot maintain the CQI of IoE service users. Therefore, the proposed Extra Trees-based regression model shows significant performance gain for mitigating the trade-off between accuracy and interpretability than other baselines.
    Open-World Semi-Supervised Learning. (arXiv:2102.03526v3 [cs.LG] UPDATED)
    (0 min) A fundamental limitation of applying semi-supervised learning in real-world settings is the assumption that unlabeled test data contains only classes previously encountered in the labeled training data. However, this assumption rarely holds for data in-the-wild, where instances belonging to novel classes may appear at testing time. Here, we introduce a novel open-world semi-supervised learning setting that formalizes the notion that novel classes may appear in the unlabeled test data. In this novel setting, the goal is to solve the class distribution mismatch between labeled and unlabeled data, where at the test time every input instance either needs to be classified into one of the existing classes or a new unseen class needs to be initialized. To tackle this challenging problem, we propose ORCA, an end-to-end deep learning approach that introduces uncertainty adaptive margin mechanism to circumvent the bias towards seen classes caused by learning discriminative features for seen classes faster than for the novel classes. In this way, ORCA reduces the gap between intra-class variance of seen with respect to novel classes. Experiments on image classification datasets and a single-cell annotation dataset demonstrate that ORCA consistently outperforms alternative baselines, achieving 25% improvement on seen and 96% improvement on novel classes of the ImageNet dataset.
    Generating 3D Molecules Conditional on Receptor Binding Sites with Deep Generative Models. (arXiv:2110.15200v2 [q-bio.QM] UPDATED)
    (0 min) The goal of structure-based drug discovery is to find small molecules that bind to a given target protein. Deep learning has been used to generate drug-like molecules with certain cheminformatic properties, but has not yet been applied to generating 3D molecules predicted to bind to proteins by sampling the conditional distribution of protein-ligand binding interactions. In this work, we describe for the first time a deep learning system for generating 3D molecular structures conditioned on a receptor binding site. We approach the problem using a conditional variational autoencoder trained on an atomic density grid representation of cross-docked protein-ligand structures. We apply atom fitting and bond inference procedures to construct valid molecular conformations from generated atomic densities. We evaluate the properties of the generated molecules and demonstrate that they change significantly when conditioned on mutated receptors. We also explore the latent space learned by our generative model using sampling and interpolation techniques. This work opens the door for end-to-end prediction of stable bioactive molecules from protein structures with deep learning.
    Wav2vec-Switch: Contrastive Learning from Original-noisy Speech Pairs for Robust Speech Recognition. (arXiv:2110.04934v2 [cs.CL] UPDATED)
    (0 min) The goal of self-supervised learning (SSL) for automatic speech recognition (ASR) is to learn good speech representations from a large amount of unlabeled speech for the downstream ASR task. However, most SSL frameworks do not consider noise robustness which is crucial for real-world applications. In this paper we propose wav2vec-Switch, a method to encode noise robustness into contextualized representations of speech via contrastive learning. Specifically, we feed original-noisy speech pairs simultaneously into the wav2vec 2.0 network. In addition to the existing contrastive learning task, we switch the quantized representations of the original and noisy speech as additional prediction targets of each other. By doing this, it enforces the network to have consistent predictions for the original and noisy speech, thus allows to learn contextualized representation with noise robustness. Our experiments on synthesized and real noisy data show the effectiveness of our method: it achieves 2.9--4.9% relative word error rate (WER) reduction on the synthesized noisy LibriSpeech data without deterioration on the original data, and 5.7% on CHiME-4 real 1-channel noisy data compared to a data augmentation baseline even with a strong language model for decoding. Our results on CHiME-4 can match or even surpass those with well-designed speech enhancement components.
    Physics-informed linear regression is competitive with two Machine Learning methods in residential building MPC. (arXiv:2110.15911v2 [cs.LG] UPDATED)
    (0 min) Because physics-based building models are difficult to obtain as each building is individual, there is an increasing interest in generating models suitable for building MPC directly from measurement data. Machine learning methods have been widely applied to this problem and validated mostly in simulation; there are, however, few studies on a direct comparison of different models or validation in real buildings to be found in the literature. Methods that are indeed validated in application often lead to computationally complex non-convex optimization problems. Here we compare physics-informed Autoregressive-Moving-Average with Exogenous Inputs (ARMAX) models to Machine Learning models based on Random Forests and Input Convex Neural Networks and the resulting convex MPC schemes in experiments on a practical building application with the goal of minimizing energy consumption while maintaining occupant comfort, and in a numerical case study. We demonstrate that Predictive Control in general leads to savings between 26% and 49% of heating and cooling energy, compared to the building's baseline hysteresis controller. Moreover, we show that all model types lead to satisfactory control performance in terms of constraint satisfaction and energy reduction. However, we also see that the physics-informed ARMAX models have a lower computational burden, and a superior sample efficiency compared to the Machine Learning based models. Moreover, even if abundant training data is available, the ARMAX models have a significantly lower prediction error than the Machine Learning models, which indicates that the encoded physics-based prior of the former cannot independently be found by the latter.
    Self-Learning Tuning for Post-Silicon Validation. (arXiv:2111.08995v3 [cs.LG] UPDATED)
    (0 min) Increasing complexity of modern chips makes design validation more difficult. Existing approaches are not able anymore to cope with the complexity of tasks such as robust performance tuning in post-silicon validation. Therefore, we propose a novel approach based on learn-to-optimize and reinforcement learning in order to solve complex and mixed-type tuning tasks in a efficient and robust way.
    LightSAFT: Lightweight Latent Source Aware Frequency Transform for Source Separation. (arXiv:2111.12516v2 [eess.AS] UPDATED)
    (0 min) Conditioned source separations have attracted significant attention because of their flexibility, applicability and extensionality. Their performance was usually inferior to the existing approaches, such as the single source separation model. However, a recently proposed method called LaSAFT-Net has shown that conditioned models can show comparable performance against existing single-source separation models. This paper presents LightSAFT-Net, a lightweight version of LaSAFT-Net. As a baseline, it provided a sufficient SDR performance for comparison during the Music Demixing Challenge at ISMIR 2021. This paper also enhances the existing LightSAFT-Net by replacing the LightSAFT blocks in the encoder with TFC-TDF blocks. Our enhanced LightSAFT-Net outperforms the previous one with fewer parameters.Conditioned source separations have attracted significant attention because of their flexibility, applicability and extensionality. Their performance was usually inferior to the existing approaches, such as the single source separation model. However, a recently proposed method called LaSAFT-Net has shown that conditioned models can show comparable performance against existing single-source separation models. This paper presents LightSAFT-Net, a lightweight version of LaSAFT-Net. As a baseline, it provided a sufficient SDR performance for comparison during the Music Demixing Challenge at ISMIR 2021.
    Equivariant Quantum Graph Circuits. (arXiv:2112.05261v2 [cs.LG] UPDATED)
    (0 min) We investigate quantum circuits for graph representation learning, and propose equivariant quantum graph circuits (EQGCs), as a class of parameterized quantum circuits with strong relational inductive bias for learning over graph-structured data. Conceptually, EQGCs serve as a unifying framework for quantum graph representation learning, allowing us to define several interesting subclasses subsuming existing proposals. In terms of the representation power, we prove that the subclasses of interest are universal approximators for functions over the bounded graph domain, and provide experimental evidence. Our theoretical perspective on quantum graph machine learning methods opens many directions for further work, and could lead to models with capabilities beyond those of classical approaches.
    A Unified Strategy for Multilingual Grammatical Error Correction with Pre-trained Cross-Lingual Language Model. (arXiv:2201.10707v1 [cs.CL])
    (0 min) Synthetic data construction of Grammatical Error Correction (GEC) for non-English languages relies heavily on human-designed and language-specific rules, which produce limited error-corrected patterns. In this paper, we propose a generic and language-independent strategy for multilingual GEC, which can train a GEC system effectively for a new non-English language with only two easy-to-access resources: 1) a pretrained cross-lingual language model (PXLM) and 2) parallel translation data between English and the language. Our approach creates diverse parallel GEC data without any language-specific operations by taking the non-autoregressive translation generated by PXLM and the gold translation as error-corrected sentence pairs. Then, we reuse PXLM to initialize the GEC model and pretrain it with the synthetic data generated by itself, which yields further improvement. We evaluate our approach on three public benchmarks of GEC in different languages. It achieves the state-of-the-art results on the NLPCC 2018 Task 2 dataset (Chinese) and obtains competitive performance on Falko-Merlin (German) and RULEC-GEC (Russian). Further analysis demonstrates that our data construction method is complementary to rule-based approaches.
    Explaining Hyperparameter Optimization via Partial Dependence Plots. (arXiv:2111.04820v2 [cs.LG] UPDATED)
    (0 min) Automated hyperparameter optimization (HPO) can support practitioners to obtain peak performance in machine learning models. However, there is often a lack of valuable insights into the effects of different hyperparameters on the final model performance. This lack of explainability makes it difficult to trust and understand the automated HPO process and its results. We suggest using interpretable machine learning (IML) to gain insights from the experimental data obtained during HPO with Bayesian optimization (BO). BO tends to focus on promising regions with potential high-performance configurations and thus induces a sampling bias. Hence, many IML techniques, such as the partial dependence plot (PDP), carry the risk of generating biased interpretations. By leveraging the posterior uncertainty of the BO surrogate model, we introduce a variant of the PDP with estimated confidence bands. We propose to partition the hyperparameter space to obtain more confident and reliable PDPs in relevant sub-regions. In an experimental study, we provide quantitative evidence for the increased quality of the PDPs within sub-regions.
    KML: Using Machine Learning to Improve Storage Systems. (arXiv:2111.11554v2 [cs.OS] UPDATED)
    (0 min) Operating systems include many heuristic algorithms designed to improve overall storage performance and throughput. Because such heuristics cannot work well for all conditions and workloads, system designers resorted to exposing numerous tunable parameters to users -- thus burdening users with continually optimizing their own storage systems and applications. Storage systems are usually responsible for most latency in I/O-heavy applications, so even a small latency improvement can be significant. Machine learning (ML) techniques promise to learn patterns, generalize from them, and enable optimal solutions that adapt to changing workloads. We propose that ML solutions become a first-class component in OSs and replace manual heuristics to optimize storage systems dynamically. In this paper, we describe our proposed ML architecture, called KML. We developed a prototype KML architecture and applied it to two case studies: optimizing readahead and NFS read-size values. Our experiments show that KML consumes less than 4KB of dynamic kernel memory, has a CPU overhead smaller than 0.2%, and yet can learn patterns and improve I/O throughput by as much as 2.3x and 15x for two case studies -- even for complex, never-seen-before, concurrently running mixed workloads on different storage devices.
    Generalization Error Bounds on Deep Learning with Markov Datasets. (arXiv:2201.11059v1 [stat.ML])
    (0 min) In this paper, we derive upper bounds on generalization errors for deep neural networks with Markov datasets. These bounds are developed based on Koltchinskii and Panchenko's approach for bounding the generalization error of combined classifiers with i.i.d. datasets. The development of new symmetrization inequalities in high-dimensional probability for Markov chains is a key element in our extension, where the pseudo-spectral gap of the infinitesimal generator of the Markov chain plays as a key parameter in these inequalities. We also propose a simple method to convert these bounds and other similar bounds on traditional deep learning and machine learning to Bayesian counterparts for both i.i.d. and Markov datasets.
    ReLU Regression with Massart Noise. (arXiv:2109.04623v2 [cs.LG] UPDATED)
    (0 min) We study the fundamental problem of ReLU regression, where the goal is to fit Rectified Linear Units (ReLUs) to data. This supervised learning task is efficiently solvable in the realizable setting, but is known to be computationally hard with adversarial label noise. In this work, we focus on ReLU regression in the Massart noise model, a natural and well-studied semi-random noise model. In this model, the label of every point is generated according to a function in the class, but an adversary is allowed to change this value arbitrarily with some probability, which is {\em at most} $\eta < 1/2$. We develop an efficient algorithm that achieves exact parameter recovery in this model under mild anti-concentration assumptions on the underlying distribution. Such assumptions are necessary for exact recovery to be information-theoretically possible. We demonstrate that our algorithm significantly outperforms naive applications of $\ell_1$ and $\ell_2$ regression on both synthetic and real data.
    Towards General Deep Leakage in Federated Learning. (arXiv:2110.09074v2 [cs.LG] UPDATED)
    (0 min) Unlike traditional central training, federated learning (FL) improves the performance of the global model by sharing and aggregating local models rather than local data to protect the users' privacy. Although this training approach appears secure, some research has demonstrated that an attacker can still recover private data based on the shared gradient information. This on-the-fly reconstruction attack deserves to be studied in depth because it can occur at any stage of training, whether at the beginning or at the end of model training; no relevant dataset is required and no additional models need to be trained. We break through some unrealistic assumptions and limitations to apply this reconstruction attack in a broader range of scenarios. We propose methods that can reconstruct the training data from shared gradients or weights, corresponding to the FedSGD and FedAvg usage scenarios, respectively. We propose a zero-shot approach to restore labels even if there are duplicate labels in the batch. We study the relationship between the label and image restoration. We find that image restoration fails even if there is only one incorrectly inferred label in the batch; we also find that when batch images have the same label, the corresponding image is restored as a fusion of that class of images. Our approaches are evaluated on classic image benchmarks, including CIFAR-10 and ImageNet. The batch size, image quality, and the adaptability of the label distribution of our approach exceed those of GradInversion, the state-of-the-art.
    A SOM-based Gradient-Free Deep Learning Method with Convergence Analysis. (arXiv:2101.05612v2 [cs.LG] UPDATED)
    (0 min) As gradient descent method in deep learning causes a series of questions, this paper proposes a novel gradient-free deep learning structure. By adding a new module into traditional Self-Organizing Map and introducing residual into the map, a Deep Valued Self-Organizing Map network is constructed. And analysis about the convergence performance of such a deep Valued Self-Organizing Map network is proved in this paper, which gives an inequality about the designed parameters with the dimension of inputs and the loss of prediction.
    Variational multiple shooting for Bayesian ODEs with Gaussian processes. (arXiv:2106.10905v2 [cs.LG] UPDATED)
    (0 min) Recent machine learning advances have proposed black-box estimation of unknown continuous-time system dynamics directly from data. However, earlier works are based on approximative ODE solutions or point estimates. We propose a novel Bayesian nonparametric model that uses Gaussian processes to infer posteriors of unknown ODE systems directly from data. We derive sparse variational inference with decoupled functional sampling to represent vector field posteriors. We also introduce a probabilistic shooting augmentation to enable efficient inference from arbitrarily long trajectories. The method demonstrates the benefit of computing vector field posteriors, with predictive uncertainty scores outperforming alternative methods on multiple ODE learning tasks.
    DONE: Distributed Approximate Newton-type Method for Federated Edge Learning. (arXiv:2012.05625v4 [cs.LG] UPDATED)
    (0 min) There is growing interest in applying distributed machine learning to edge computing, forming federated edge learning. Federated edge learning faces non-i.i.d. and heterogeneous data, and the communication between edge workers, possibly through distant locations and with unstable wireless networks, is more costly than their local computational overhead. In this work, we propose DONE, a distributed approximate Newton-type algorithm with fast convergence rate for communication-efficient federated edge learning. First, with strongly convex and smooth loss functions, DONE approximates the Newton direction in a distributed manner using the classical Richardson iteration on each edge worker. Second, we prove that DONE has linear-quadratic convergence and analyze its communication complexities. Finally, the experimental results with non-i.i.d. and heterogeneous data show that DONE attains a comparable performance to the Newton's method. Notably, DONE requires fewer communication iterations compared to distributed gradient descent and outperforms DANE and FEDL, state-of-the-art approaches, in the case of non-quadratic loss functions.
    CNN-Based Image Reconstruction Method for Ultrafast Ultrasound Imaging. (arXiv:2008.12750v2 [eess.IV] UPDATED)
    (0 min) Ultrafast ultrasound (US) revolutionized biomedical imaging with its capability of acquiring full-view frames at over 1 kHz, unlocking breakthrough modalities such as shear-wave elastography and functional US neuroimaging. Yet, it suffers from strong diffraction artifacts, mainly caused by grating lobes, side lobes, or edge waves. Multiple acquisitions are typically required to obtain a sufficient image quality, at the cost of a reduced frame rate. To answer the increasing demand for high-quality imaging from single unfocused acquisitions, we propose a two-step convolutional neural network (CNN)-based image reconstruction method, compatible with real-time imaging. A low-quality estimate is obtained by means of a backprojection-based operation, akin to conventional delay-and-sum beamforming, from which a high-quality image is restored using a residual CNN with multi-scale and multi-channel filtering properties, trained specifically to remove the diffraction artifacts inherent to ultrafast US imaging. To account for both the high dynamic range and the oscillating properties of radio frequency US images, we introduce the mean signed logarithmic absolute error (MSLAE) as training loss function. Experiments were conducted with a linear transducer array, in single plane wave (PW) imaging. Trainings were performed on a simulated dataset, crafted to contain a wide diversity of structures and echogenicities. Extensive numerical evaluations demonstrate that the proposed approach can reconstruct images from single PWs with a quality similar to that of gold-standard synthetic aperture imaging, on a dynamic range in excess of 60 dB. In vitro and in vivo experiments show that trainings carried out on simulated data perform well in experimental settings.
    Momentum Capsule Networks. (arXiv:2201.11091v1 [cs.CV])
    (0 min) Capsule networks are a class of neural networks that achieved promising results on many computer vision tasks. However, baseline capsule networks have failed to reach state-of-the-art results on more complex datasets due to the high computation and memory requirements. We tackle this problem by proposing a new network architecture, called Momentum Capsule Network (MoCapsNet). MoCapsNets are inspired by Momentum ResNets, a type of network that applies reversible residual building blocks. Reversible networks allow for recalculating activations of the forward pass in the backpropagation algorithm, so those memory requirements can be drastically reduced. In this paper, we provide a framework on how invertible residual building blocks can be applied to capsule networks. We will show that MoCapsNet beats the accuracy of baseline capsule networks on MNIST, SVHN and CIFAR-10 while using considerably less memory. The source code is available on https://github.com/moejoe95/MoCapsNet.
    Promises and Challenges of Causality for Ethical Machine Learning. (arXiv:2201.10683v1 [cs.LG])
    (0 min) In recent years, there has been increasing interest in causal reasoning for designing fair decision-making systems due to its compatibility with legal frameworks, interpretability for human stakeholders, and robustness to spurious correlations inherent in observational data, among other factors. The recent attention to causal fairness, however, has been accompanied with great skepticism due to practical and epistemological challenges with applying current causal fairness approaches in the literature. Motivated by the long-standing empirical work on causality in econometrics, social sciences, and biomedical sciences, in this paper we lay out the conditions for appropriate application of causal fairness under the "potential outcomes framework." We highlight key aspects of causal inference that are often ignored in the causal fairness literature. In particular, we discuss the importance of specifying the nature and timing of interventions on social categories such as race or gender. Precisely, instead of postulating an intervention on immutable attributes, we propose a shift in focus to their perceptions and discuss the implications for fairness evaluation. We argue that such conceptualization of the intervention is key in evaluating the validity of causal assumptions and conducting sound causal analysis including avoiding post-treatment bias. Subsequently, we illustrate how causality can address the limitations of existing fairness metrics, including those that depend upon statistical correlations. Specifically, we introduce causal variants of common statistical notions of fairness, and we make a novel observation that under the causal framework there is no fundamental disagreement between different notions of fairness. Finally, we conduct extensive experiments where we demonstrate our approach for evaluating and mitigating unfairness, specially when post-treatment variables are present.
    A Deeper Look at Discounting Mismatch in Actor-Critic Algorithms. (arXiv:2010.01069v4 [cs.LG] UPDATED)
    (0 min) We investigate the discounting mismatch in actor-critic algorithm implementations from a representation learning perspective. Theoretically, actor-critic algorithms usually have discounting for both actor and critic, i.e., there is a $\gamma^t$ term in the actor update for the transition observed at time $t$ in a trajectory and the critic is a discounted value function. Practitioners, however, usually ignore the discounting ($\gamma^t$) for the actor while using a discounted critic. We investigate this mismatch in two scenarios. In the first scenario, we consider optimizing an undiscounted objective $(\gamma = 1)$ where $\gamma^t$ disappears naturally $(1^t = 1)$. We then propose to interpret the discounting in critic in terms of a bias-variance-representation trade-off and provide supporting empirical results. In the second scenario, we consider optimizing a discounted objective ($\gamma < 1$) and propose to interpret the omission of the discounting in the actor update from an auxiliary task perspective and provide supporting empirical results.
    Machine Learning Characterization of Cancer Patients-Derived Extracellular Vesicles using Vibrational Spectroscopies. (arXiv:2107.10332v7 [q-bio.OT] UPDATED)
    (0 min) The early detection of cancer is a challenging problem in medicine. The blood sera of cancer patients are enriched with heterogeneous secretory lipid bound extracellular vesicles (EVs), which present a complex repertoire of information and biomarkers, representing their cell of origin, that are being currently studied in the field of liquid biopsy and cancer screening. Vibrational spectroscopies provide non-invasive approaches for the assessment of structural and biophysical properties in complex biological samples. In this pilot study, multiple Raman spectroscopy measurements were performed on the EVs extracted from the blood sera of 9 patients consisting of four different cancer subtypes (colorectal cancer, hepatocellular carcinoma, breast cancer and pancreatic cancer) and five healthy patients (controls). FTIR (Fourier Transform Infrared) spectroscopy measurements were performed as a complementary approach to Raman analysis, on two of the four cancer subtypes. The AdaBoost Random Forest Classifier, Decision Trees, and Support Vector Machines (SVM) distinguished the baseline corrected Raman spectra of cancer EVs from those of healthy controls (18 spectra) with a classification accuracy of above 90 percent when reduced to a spectral frequency range of 1800 to 1940 inverse cm and subjected to a 50:50 training: testing split. FTIR classification accuracy on 14 spectra showed an 80 percent classification accuracy. Our findings demonstrate that basic machine learning algorithms are powerful applied intelligence tools to distinguish the complex vibrational spectra of cancer patient EVs from those of healthy patients. These experimental methods hold promise as valid and efficient liquid biopsy for artificial intelligence-assisted early cancer screening.
    Splintering with distributions: A stochastic decoy scheme for private computation. (arXiv:2007.02719v3 [cs.LG] UPDATED)
    (0 min) Performing computations while maintaining privacy is an important problem in todays distributed machine learning solutions. Consider the following two set ups between a client and a server, where in setup i) the client has a public data vector $\mathbf{x}$, the server has a large private database of data vectors $\mathcal{B}$ and the client wants to find the inner products $\langle \mathbf{x,y_k} \rangle, \forall \mathbf{y_k} \in \mathcal{B}$. The client does not want the server to learn $\mathbf{x}$ while the server does not want the client to learn the records in its database. This is in contrast to another setup ii) where the client would like to perform an operation solely on its data, such as computation of a matrix inverse on its data matrix $\mathbf{M}$, but would like to use the superior computing ability of the server to do so without having to leak $\mathbf{M}$ to the server. \par We present a stochastic scheme for splitting the client data into privatized shares that are transmitted to the server in such settings. The server performs the requested operations on these shares instead of on the raw client data at the server. The obtained intermediate results are sent back to the client where they are assembled by the client to obtain the final result.
    Online POI Recommendation: Learning Dynamic Geo-Human Interactions in Streams. (arXiv:2201.10983v1 [cs.IR])
    (0 min) In this paper, we focus on the problem of modeling dynamic geo-human interactions in streams for online POI recommendations. Specifically, we formulate the in-stream geo-human interaction modeling problem into a novel deep interactive reinforcement learning framework, where an agent is a recommender and an action is a next POI to visit. We uniquely model the reinforcement learning environment as a joint and connected composition of users and geospatial contexts (POIs, POI categories, functional zones). An event that a user visits a POI in stream updates the states of both users and geospatial contexts; the agent perceives the updated environment state to make online recommendations. Specifically, we model a mixed-user event stream by unifying all users, visits, and geospatial contexts as a dynamic knowledge graph stream, in order to model human-human, geo-human, geo-geo interactions. We design an exit mechanism to address the expired information challenge, devise a meta-path method to address the recommendation candidate generation challenge, and develop a new deep policy network structure to address the varying action space challenge, and, finally, propose an effective adversarial training method for optimization. Finally, we present extensive experiments to demonstrate the enhanced performance of our method.
    Discovering Boundary Values of Feature-based Machine Learning Classifiers through Exploratory Datamorphic Testing. (arXiv:2110.00330v2 [cs.LG] UPDATED)
    (0 min) Testing has been widely recognised as difficult for AI applications. This paper proposes a set of testing strategies for testing machine learning applications in the framework of the datamorphism testing methodology. In these strategies, testing aims at exploring the data space of a classification or clustering application to discover the boundaries between classes that the machine learning application defines. This enables the tester to understand precisely the behaviour and function of the software under test. In the paper, three variants of exploratory strategies are presented with the algorithms implemented in the automated datamorphic testing tool Morphy. The correctness of these algorithms are formally proved. Their capability and cost of discovering borders between classes are evaluated via a set of controlled experiments with manually designed subjects and a set of case studies with real machine learning models.
    Learning Sparse Masks for Diffusion-based Image Inpainting. (arXiv:2110.02636v2 [eess.IV] UPDATED)
    (0 min) Diffusion-based inpainting is a powerful tool for the reconstruction of images from sparse data. Its quality strongly depends on the choice of known data. Optimising their spatial location -- the inpainting mask -- is challenging. A commonly used tool for this task are stochastic optimisation strategies. However, they are slow as they compute multiple inpainting results. We provide a remedy in terms of a learned mask generation model. By emulating the complete inpainting pipeline with two networks for mask generation and neural surrogate inpainting, we obtain a model for highly efficient adaptive mask generation. Experiments indicate that our model can achieve competitive quality with an acceleration by as much as four orders of magnitude. Our findings serve as a basis for making diffusion-based inpainting more attractive for applications such as image compression, where fast encoding is highly desirable.
    Federated Learning for Internet of Things: Applications, Challenges, and Opportunities. (arXiv:2111.07494v2 [cs.LG] UPDATED)
    (0 min) Billions of IoT devices will be deployed in the near future, taking advantage of faster Internet speed and the possibility of orders of magnitude more endpoints brought by 5G/6G. With the growth of IoT devices, vast quantities of data that may contain users' private information will be generated. The high communication and storage costs, mixed with privacy concerns, will increasingly challenge the traditional ecosystem of centralized over-the-cloud learning and processing for IoT platforms. Federated Learning (FL) has emerged as the most promising alternative approach to this problem. In FL, training data-driven machine learning models is an act of collaboration between multiple clients without requiring the data to be brought to a central point, hence alleviating communication and storage costs and providing a great degree of user-level privacy. However, there are still some challenges existing in the real FL system implementation on IoT networks. In this paper, we will discuss the opportunities and challenges of FL in IoT platforms, as well as how it can enable diverse IoT applications. In particular, we identify and discuss seven critical challenges of FL in IoT platforms and highlight some recent promising approaches towards addressing them.
    Complexity of zigzag sampling algorithm for strongly log-concave distributions. (arXiv:2012.11094v2 [stat.ML] UPDATED)
    (0 min) We study the computational complexity of zigzag sampling algorithm for strongly log-concave distributions. The zigzag process has the advantage of not requiring time discretization for implementation, and that each proposed bouncing event requires only one evaluation of partial derivative of the potential, while its convergence rate is dimension independent. Using these properties, we prove that the zigzag sampling algorithm achieves $\varepsilon$ error in chi-square divergence with a computational cost equivalent to $O\bigl(\kappa^2 d^\frac{1}{2}(\log\frac{1}{\varepsilon})^{\frac{3}{2}}\bigr)$ gradient evaluations in the regime $\kappa \ll \frac{d}{\log d}$ under a warm start assumption, where $\kappa$ is the condition number and $d$ is the dimension.
    Graph Neural Networks with Dynamic and Static Representations for Social Recommendation. (arXiv:2201.10751v1 [cs.IR])
    (0 min) Recommender systems based on graph neural networks receive increasing research interest due to their excellent ability to learn a variety of side information including social networks. However, previous works usually focus on modeling users, not much attention is paid to items. Moreover, the possible changes in the attraction of items over time, which is like the dynamic interest of users are rarely considered, and neither do the correlations among items. To overcome these limitations, this paper proposes graph neural networks with dynamic and static representations for social recommendation (GNN-DSR), which considers both dynamic and static representations of users and items and incorporates their relational influence. GNN-DSR models the short-term dynamic and long-term static interactional representations of the user's interest and the item's attraction, respectively. Furthermore, the attention mechanism is used to aggregate the social influence of users on the target user and the correlative items' influence on a given item. The final latent factors of user and item are combined to make a prediction. Experiments on three real-world recommender system datasets validate the effectiveness of GNN-DSR.
    BRIDGE: Byzantine-resilient Decentralized Gradient Descent. (arXiv:1908.08098v2 [stat.ML] UPDATED)
    (0 min) Machine learning has begun to play a central role in many applications. A multitude of these applications typically also involve datasets that are distributed across multiple computing devices/machines due to either design constraints (e.g., multiagent systems) or computational/privacy reasons (e.g., learning on smartphone data). Such applications often require the learning tasks to be carried out in a decentralized fashion, in which there is no central server that is directly connected to all nodes. In real-world decentralized settings, nodes are prone to undetected failures due to malfunctioning equipment, cyberattacks, etc., which are likely to crash non-robust learning algorithms. The focus of this paper is on robustification of decentralized learning in the presence of nodes that have undergone Byzantine failures. The Byzantine failure model allows faulty nodes to arbitrarily deviate from their intended behaviors, thereby ensuring designs of the most robust of algorithms. But the study of Byzantine resilience within decentralized learning, in contrast to distributed learning, is still in its infancy. In particular, existing Byzantine-resilient decentralized learning methods either do not scale well to large-scale machine learning models, or they lack statistical convergence guarantees that help characterize their generalization errors. In this paper, a scalable, Byzantine-resilient decentralized machine learning framework termed Byzantine-resilient decentralized gradient descent (BRIDGE) is introduced. Algorithmic and statistical convergence guarantees for one variant of BRIDGE are also provided in the paper for both strongly convex problems and a class of nonconvex problems. In addition, large-scale decentralized learning experiments are used to establish that the BRIDGE framework is scalable and it delivers competitive results for Byzantine-resilient convex and nonconvex learning.
    FIGARO: Generating Symbolic Music with Fine-Grained Artistic Control. (arXiv:2201.10936v1 [cs.SD])
    (0 min) Generating music with deep neural networks has been an area of active research in recent years. While the quality of generated samples has been steadily increasing, most methods are only able to exert minimal control over the generated sequence, if any. We propose the self-supervised \emph{description-to-sequence} task, which allows for fine-grained controllable generation on a global level by extracting high-level features about the target sequence and learning the conditional distribution of sequences given the corresponding high-level description in a sequence-to-sequence modelling setup. We train FIGARO (FIne-grained music Generation via Attention-based, RObust control) by applying \emph{description-to-sequence} modelling to symbolic music. By combining learned high level features with domain knowledge, which acts as a strong inductive bias, the model achieves state-of-the-art results in controllable symbolic music generation and generalizes well beyond the training distribution.
    Joint Liver and Hepatic Lesion Segmentation using a Hybrid CNN with Transformer Layers. (arXiv:2201.10981v1 [eess.IV])
    (0 min) Deep learning-based segmentation of the liver and hepatic lesions therein steadily gains relevance in clinical practice due to the increasing incidence of liver cancer each year. Whereas various network variants with overall promising results in the field of medical image segmentation have been developed over the last years, almost all of them struggle with the challenge of accurately segmenting hepatic lesions. This lead to the idea of combining elements of convolutional and transformerbased architectures to overcome the existing limitations. This work presents a hybrid network called SWTR-Unet, consisting of a pretrained ResNet, transformer blocks as well as a common Unet-style decoder path. This network was applied to clinical liver MRI, as well as to the publicly available CT data of the liver tumor segmentation (LiTS) challenge. Additionally, multiple state-of-the-art networks were implemented and applied to both datasets, ensuring a direct comparability. Furthermore, correlation analysis and an ablation study were carried out, to investigate various influencing factors on the segmentation accuracy of our presented method. With Dice similarity scores of averaged 98 +- 2 % for liver and 81 +- 28 % lesion segmentation on the MRI dataset and 97 +- 2 % and 79 +- 25 %, respectively on the CT dataset, the proposed SWTR-Unet outperforms each of the additionally implemented state-of-the-art networks. The achieved segmentation accuracy was found to be on par with manually performed expert segmentations as indicated by interobserver variabilities for liver lesion segmentation. In conclusion, the presented method could save valuable time and resources in clinical practice.
    Evaluating language-biased image classification based on semantic representations. (arXiv:2201.11014v1 [cs.CV])
    (0 min) Humans show language-biased image recognition for a word-embedded image, known as picture-word interference. Such interference depends on hierarchical semantic categories and reflects that human language processing highly interacts with visual processing. Similar to humans, recent artificial models jointly trained on texts and images, e.g., OpenAI CLIP, show language-biased image classification. Exploring whether the bias leads to interferences similar to those observed in humans can contribute to understanding how much the model acquires hierarchical semantic representations from joint learning of language and vision. The present study introduces methodological tools from the cognitive science literature to assess the biases of artificial models. Specifically, we introduce a benchmark task to test whether words superimposed on images can distort the image classification across different category levels and, if it can, whether the perturbation is due to the shared semantic representation between language and vision. Our dataset is a set of word-embedded images and consists of a mixture of natural image datasets and hierarchical word labels with superordinate/basic category levels. Using this benchmark test, we evaluate the CLIP model. We show that presenting words distorts the image classification by the model across different category levels, but the effect does not depend on the semantic relationship between images and embedded words. This suggests that the semantic word representation in the CLIP visual processing is not shared with the image representation, although the word representation strongly dominates for word-embedded images.
    Do Neural Networks for Segmentation Understand Insideness?. (arXiv:2201.10664v1 [cs.CV])
    (0 min) The insideness problem is an aspect of image segmentation that consists of determining which pixels are inside and outside a region. Deep Neural Networks (DNNs) excel in segmentation benchmarks, but it is unclear if they have the ability to solve the insideness problem as it requires evaluating long-range spatial dependencies. In this paper, the insideness problem is analysed in isolation, without texture or semantic cues, such that other aspects of segmentation do not interfere in the analysis. We demonstrate that DNNs for segmentation with few units have sufficient complexity to solve insideness for any curve. Yet, such DNNs have severe problems with learning general solutions. Only recurrent networks trained with small images learn solutions that generalize well to almost any curve. Recurrent networks can decompose the evaluation of long-range dependencies into a sequence of local operations, and learning with small images alleviates the common difficulties of training recurrent networks with a large number of unrolling steps.
    Model-free and Bayesian Ensembling Model-based Deep Reinforcement Learning for Particle Accelerator Control Demonstrated on the FERMI FEL. (arXiv:2012.09737v2 [cs.LG] UPDATED)
    (0 min) Reinforcement learning holds tremendous promise in accelerator controls. The primary goal of this paper is to show how this approach can be utilised on an operational level on accelerator physics problems. Despite the success of model-free reinforcement learning in several domains, sample-efficiency still is a bottle-neck, which might be encompassed by model-based methods. We compare well-suited purely model-based to model-free reinforcement learning applied to the intensity optimisation on the FERMI FEL system. We find that the model-based approach demonstrates higher representational power and sample-efficiency, while the asymptotic performance of the model-free method is slightly superior. The model-based algorithm is implemented in a DYNA-style using an uncertainty aware model, and the model-free algorithm is based on tailored deep Q-learning. In both cases, the algorithms were implemented in a way, which presents increased noise robustness as omnipresent in accelerator control problems. Code is released in https://github.com/MathPhysSim/FERMI_RL_Paper.
    Information-Theoretic Characterization of the Generalization Error for Iterative Semi-Supervised Learning. (arXiv:2110.00926v2 [cs.LG] UPDATED)
    (0 min) Using information-theoretic principles, we consider the generalization error (gen-error) of iterative semi-supervised learning (SSL) algorithms that iteratively generate pseudo-labels for a large amount of unlabelled data to progressively refine the model parameters. In contrast to most previous works that {\em bound} the gen-error, we provide an {\em exact} expression for the gen-error and particularize it to the binary Gaussian mixture model. Our theoretical results suggest that when the class conditional variances are not too large, the gen-error decreases with the number of iterations, but quickly saturates. On the flip side, if the class conditional variances (and so amount of overlap between the classes) are large, the gen-error increases with the number of iterations. To mitigate this undesirable effect, we show that regularization can reduce the gen-error. The theoretical results are corroborated by extensive experiments on the MNIST and CIFAR datasets in which we notice that for easy-to-distinguish classes, the gen-error improves after several pseudo-labelling iterations, but saturates afterwards, and for more difficult-to-distinguish classes, regularization improves the generalization performance.
    Fast Server Learning Rate Tuning for Coded Federated Dropout. (arXiv:2201.11036v1 [cs.LG])
    (0 min) In cross-device Federated Learning (FL), clients with low computational power train a common machine model by exchanging parameters updates instead of potentially private data. Federated Dropout (FD) is a technique that improves the communication efficiency of a FL session by selecting a subset of model variables to be updated in each training round. However, FD produces considerably lower accuracy and higher convergence time compared to standard FL. In this paper, we leverage coding theory to enhance FD by allowing a different sub-model to be used at each client. We also show that by carefully tuning the server learning rate hyper-parameter, we can achieve higher training speed and up to the same final accuracy of the no dropout case. For the EMNIST dataset, our mechanism achieves 99.6 % of the final accuracy of the no dropout case while requiring 2.43x less bandwidth to achieve this accuracy level.
    Dynamic Data Augmentation with Gating Networks. (arXiv:2111.03253v2 [cs.LG] UPDATED)
    (0 min) Data augmentation is a technique to improve the generalization ability of machine learning methods by increasing the size of the dataset. However, since every augmentation method is not equally effective for every dataset, you need to select an appropriate method carefully. We propose a neural network that dynamically selects the best combination of data augmentation methods using a mutually beneficial gating network and a feature consistency loss. The gating network is able to control how much of each data augmentation is used for the representation within the network. The feature consistency loss gives a constraint that augmented features from the same input should be in similar. In experiments, we demonstrate the effectiveness of the proposed method on the 12 largest time-series datasets from 2018 UCR Time Series Archive and reveal the relationships between the data augmentation methods through analysis of the proposed method.
    Bayesian Learning via Neural Schr\"odinger-F\"ollmer Flows. (arXiv:2111.10510v6 [stat.ML] UPDATED)
    (0 min) In this work we explore a new framework for approximate Bayesian inference in large datasets based on stochastic control. We advocate stochastic control as a finite time and low variance alternative to popular steady-state methods such as stochastic gradient Langevin dynamics (SGLD). Furthermore, we discuss and adapt the existing theoretical guarantees of this framework and establish connections to already existing VI routines in SDE-based models.
    Graph Neural Network for Video Relocalization. (arXiv:2007.09877v2 [cs.CV] UPDATED)
    (0 min) In this paper, we focus on video relocalization task, which uses a query video clip as input to retrieve a semantic relative video clip in another untrimmed long video. we find that in video relocalization datasets, there exists a phenomenon showing that there does not exist consistent relationship between feature similarity by frame and feature similarity by video, which affects the feature fusion among frames. However, existing video relocalization methods do not fully consider it. Taking this phenomenon into account, in this article, we treat video features as a graph by concatenating the query video feature and proposal video feature along time dimension, where each timestep is treated as a node, each row of the feature matrix is treated as feature of each node. Then, with the power of graph neural networks, we propose a Multi-Graph Feature Fusion Module to fuse the relation feature of this graph. After evaluating our method on ActivityNet v1.2 dataset and Thumos14 dataset, we find that our proposed method outperforms the state of art methods.
    SleepTransformer: Automatic Sleep Staging with Interpretability and Uncertainty Quantification. (arXiv:2105.11043v3 [cs.LG] UPDATED)
    (0 min) Background: Black-box skepticism is one of the main hindrances impeding deep-learning-based automatic sleep scoring from being used in clinical environments. Methods: Towards interpretability, this work proposes a sequence-to-sequence sleep-staging model, namely SleepTransformer. It is based on the transformer backbone and offers interpretability of the model's decisions at both the epoch and sequence level. We further propose a simple yet efficient method to quantify uncertainty in the model's decisions. The method, which is based on entropy, can serve as a metric for deferring low-confidence epochs to a human expert for further inspection. Results: Making sense of the transformer's self-attention scores for interpretability, at the epoch level, the attention scores are encoded as a heat map to highlight sleep-relevant features captured from the input EEG signal. At the sequence level, the attention scores are visualized as the influence of different neighboring epochs in an input sequence (i.e. the context) to recognition of a target epoch, mimicking the way manual scoring is done by human experts. Conclusion: Additionally, we demonstrate that SleepTransformer performs on par with existing methods on two databases of different sizes. Significance: Equipped with interpretability and the ability of uncertainty quantification, SleepTransformer holds promise for being integrated into clinical settings.
    Uphill Roads to Variational Tightness: Monotonicity and Monte Carlo Objectives. (arXiv:2201.10989v1 [stat.ML])
    (0 min) We revisit the theory of importance weighted variational inference (IWVI), a promising strategy for learning latent variable models. IWVI uses new variational bounds, known as Monte Carlo objectives (MCOs), obtained by replacing intractable integrals by Monte Carlo estimates -- usually simply obtained via importance sampling. Burda, Grosse and Salakhutdinov (2016) showed that increasing the number of importance samples provably tightens the gap between the bound and the likelihood. Inspired by this simple monotonicity theorem, we present a series of nonasymptotic results that link properties of Monte Carlo estimates to tightness of MCOs. We challenge the rationale that smaller Monte Carlo variance leads to better bounds. We confirm theoretically the empirical findings of several recent papers by showing that, in a precise sense, negative correlation reduces the variational gap. We also generalise the original monotonicity theorem by considering non-uniform weights. We discuss several practical consequences of our theoretical results. Our work borrows many ideas and results from the theory of stochastic orders.
    Explicit construction of the minimum error variance estimator for stochastic LTI state-space systems. (arXiv:2109.02384v2 [math.OC] UPDATED)
    (0 min) In this short article, we showcase the derivation of the optimal (minimum error variance) estimator, when one part of the stochastic LTI system output is not measured but is able to be predicted from the measured system outputs. Similar derivations have been done before but not using state-space representation.
    Meta-learning Spiking Neural Networks with Surrogate Gradient Descent. (arXiv:2201.10777v1 [cs.NE])
    (0 min) Adaptive "life-long" learning at the edge and during online task performance is an aspirational goal of AI research. Neuromorphic hardware implementing Spiking Neural Networks (SNNs) are particularly attractive in this regard, as their real-time, event-based, local computing paradigm makes them suitable for edge implementations and fast learning. However, the long and iterative learning that characterizes state-of-the-art SNN training is incompatible with the physical nature and real-time operation of neuromorphic hardware. Bi-level learning, such as meta-learning is increasingly used in deep learning to overcome these limitations. In this work, we demonstrate gradient-based meta-learning in SNNs using the surrogate gradient method that approximates the spiking threshold function for gradient estimations. Because surrogate gradients can be made twice differentiable, well-established, and effective second-order gradient meta-learning methods such as Model Agnostic Meta Learning (MAML) can be used. We show that SNNs meta-trained using MAML match or exceed the performance of conventional ANNs meta-trained with MAML on event-based meta-datasets. Furthermore, we demonstrate the specific advantages that accrue from meta-learning: fast learning without the requirement of high precision weights or gradients. Our results emphasize how meta-learning techniques can become instrumental for deploying neuromorphic learning technologies on real-world problems.
    Understanding and Compressing Music with Maximal Transformable Patterns. (arXiv:2201.11085v1 [cs.LG])
    (0 min) We present a polynomial-time algorithm that discovers all maximal patterns in a point set, $D\in\mathbb{R}^k$, that are related by transformations in a user-specified class, $F$, of bijections over $\mathbb{R}^k$. We also present a second algorithm that discovers the set of occurrences for each of these maximal patterns and then uses compact encodings of these occurrence sets to compute a losslessly compressed encoding of the input point set. This encoding takes the form of a set of pairs, $E=\left\lbrace\left\langle P_1, T_1\right\rangle,\left\langle P_2, T_2\right\rangle,\ldots\left\langle P_{\ell}, T_{\ell}\right\rangle\right\rbrace$, where each $\langle P_i,T_i\rangle$ consists of a maximal pattern, $P_i\subseteq D$, and a set, $T_i\subset F$, of transformations that map $P_i$ onto other subsets of $D$. Each transformation is encoded by a vector of real values that uniquely identifies it within $F$ and the length of this vector is used as a measure of the complexity of $F$. We evaluate the new compression algorithm with three transformation classes of differing complexity, on the task of classifying folk-song melodies into tune families. The most complex of the classes tested includes all combinations of the musical transformations of transposition, inversion, retrograde, augmentation and diminution. We found that broadening the transformation class improved performance on this task. However, it did not, on average, improve compression factor, which may be due to the datasets (in this case, folk-song melodies) being too short and simple to benefit from the potentially greater number of pattern relationships that are discoverable with larger transformation classes.
    Exploiting Semantic Epsilon Greedy Exploration Strategy in Multi-Agent Reinforcement Learning. (arXiv:2201.10803v1 [cs.LG])
    (0 min) Multi-agent reinforcement learning (MARL) can model many real world applications. However, many MARL approaches rely on epsilon greedy for exploration, which may discourage visiting advantageous states in hard scenarios. In this paper, we propose a new approach QMIX(SEG) for tackling MARL. It makes use of the value function factorization method QMIX to train per-agent policies and a novel Semantic Epsilon Greedy (SEG) exploration strategy. SEG is a simple extension to the conventional epsilon greedy exploration strategy, yet it is experimentally shown to greatly improve the performance of MARL. We first cluster actions into groups of actions with similar effects and then use the groups in a bi-level epsilon greedy exploration hierarchy for action selection. We argue that SEG facilitates semantic exploration by exploring in the space of groups of actions, which have richer semantic meanings than atomic actions. Experiments show that QMIX(SEG) largely outperforms QMIX and leads to strong performance competitive with current state-of-the-art MARL approaches on the StarCraft Multi-Agent Challenge (SMAC) benchmark.
    LaughNet: synthesizing laughter utterances from waveform silhouettes and a single laughter example. (arXiv:2110.04946v2 [cs.SD] UPDATED)
    (0 min) Emotional and controllable speech synthesis is a topic that has received much attention. However, most studies focused on improving the expressiveness and controllability in the context of linguistic content, even though natural verbal human communication is inseparable from spontaneous non-speech expressions such as laughter, crying, or grunting. We propose a model called LaughNet for synthesizing laughter by using waveform silhouettes as inputs. The motivation is not simply synthesizing new laughter utterances, but testing a novel synthesis-control paradigm that uses an abstract representation of the waveform. We conducted basic listening test experiments, and the results showed that LaughNet can synthesize laughter utterances with moderate quality and retain the characteristics of the training example. More importantly, the generated waveforms have shapes similar to the input silhouettes. For future work, we will test the same method on other types of human nonverbal expressions and integrate it into more elaborated synthesis systems.
    Enel: Context-Aware Dynamic Scaling of Distributed Dataflow Jobs using Graph Propagation. (arXiv:2108.12211v3 [cs.DC] UPDATED)
    (0 min) Distributed dataflow systems like Spark and Flink enable the use of clusters for scalable data analytics. While runtime prediction models can be used to initially select appropriate cluster resources given target runtimes, the actual runtime performance of dataflow jobs depends on several factors and varies over time. Yet, in many situations, dynamic scaling can be used to meet formulated runtime targets despite significant performance variance. This paper presents Enel, a novel dynamic scaling approach that uses message propagation on an attributed graph to model dataflow jobs and, thus, allows for deriving effective rescaling decisions. For this, Enel incorporates descriptive properties that capture the respective execution context, considers statistics from individual dataflow tasks, and propagates predictions through the job graph to eventually find an optimized new scale-out. Our evaluation of Enel with four iterative Spark jobs shows that our approach is able to identify effective rescaling actions, reacting for instance to node failures, and can be reused across different execution contexts.
    Invertible Voice Conversion. (arXiv:2201.10687v1 [eess.AS])
    (0 min) In this paper, we propose an invertible deep learning framework called INVVC for voice conversion. It is designed against the possible threats that inherently come along with voice conversion systems. Specifically, we develop an invertible framework that makes the source identity traceable. The framework is built on a series of invertible $1\times1$ convolutions and flows consisting of affine coupling layers. We apply the proposed framework to one-to-one voice conversion and many-to-one conversion using parallel training data. Experimental results show that this approach yields impressive performance on voice conversion and, moreover, the converted results can be reversed back to the source inputs utilizing the same parameters as in forwarding.
    Competition over data: how does data purchase affect users?. (arXiv:2201.10774v1 [cs.LG])
    (0 min) As machine learning (ML) is deployed by many competing service providers, the underlying ML predictors also compete against each other, and it is increasingly important to understand the impacts and biases from such competition. In this paper, we study what happens when the competing predictors can acquire additional labeled data to improve their prediction quality. We introduce a new environment that allows ML predictors to use active learning algorithms to purchase labeled data within their budgets while competing against each other to attract users. Our environment models a critical aspect of data acquisition in competing systems which has not been well-studied before. We found that the overall performance of an ML predictor improves when predictors can purchase additional labeled data. Surprisingly, however, the quality that users experience -- i.e. the accuracy of the predictor selected by each user -- can decrease even as the individual predictors get better. We show that this phenomenon naturally arises due to a trade-off whereby competition pushes each predictor to specialize in a subset of the population while data purchase has the effect of making predictors more uniform. We support our findings with both experiments and theories.
    Graph Neural Networks for Charged Particle Tracking on FPGAs. (arXiv:2112.02048v2 [physics.ins-det] UPDATED)
    (0 min) The determination of charged particle trajectories in collisions at the CERN Large Hadron Collider (LHC) is an important but challenging problem, especially in the high interaction density conditions expected during the future high-luminosity phase of the LHC (HL-LHC). Graph neural networks (GNNs) are a type of geometric deep learning algorithm that has successfully been applied to this task by embedding tracker data as a graph -- nodes represent hits, while edges represent possible track segments -- and classifying the edges as true or fake track segments. However, their study in hardware- or software-based trigger applications has been limited due to their large computational cost. In this paper, we introduce an automated translation workflow, integrated into a broader tool called $\texttt{hls4ml}$, for converting GNNs into firmware for field-programmable gate arrays (FPGAs). We use this translation tool to implement GNNs for charged particle tracking, trained using the TrackML challenge dataset, on FPGAs with designs targeting different graph sizes, task complexites, and latency/throughput requirements. This work could enable the inclusion of charged particle tracking GNNs at the trigger level for HL-LHC experiments.
    Multiaccurate Proxies for Downstream Fairness. (arXiv:2107.04423v2 [cs.LG] UPDATED)
    (0 min) We study the problem of training a model that must obey demographic fairness conditions when the sensitive features are not available at training time -- in other words, how can we train a model to be fair by race when we don't have data about race? We adopt a fairness pipeline perspective, in which an "upstream" learner that does have access to the sensitive features will learn a proxy model for these features from the other attributes. The goal of the proxy is to allow a general "downstream" learner -- with minimal assumptions on their prediction task -- to be able to use the proxy to train a model that is fair with respect to the true sensitive features. We show that obeying multiaccuracy constraints with respect to the downstream model class suffices for this purpose, provide sample- and oracle efficient-algorithms and generalization bounds for learning such proxies, and conduct an experimental evaluation. In general, multiaccuracy is much easier to satisfy than classification accuracy, and can be satisfied even when the sensitive features are hard to predict.
    Adaptive Resonance Theory-based Topological Clustering with a Divisive Hierarchical Structure Capable of Continual Learning. (arXiv:2201.10713v1 [cs.LG])
    (0 min) Thanks to an ability for handling the plasticity-stability dilemma, Adaptive Resonance Theory (ART) is considered as an effective approach for realizing continual learning. In general, however, the clustering performance of ART-based algorithms strongly depends on a similarity threshold, i.e., a vigilance parameter, which is data-dependent and specified by hand. This paper proposes an ART-based topological clustering algorithm with a mechanism that automatically estimates a similarity threshold from a distribution of data points. In addition, for the improving information extraction performance, a divisive hierarchical clustering algorithm capable of continual learning is proposed by introducing a hierarchical structure to the proposed algorithm. Simulation experiments show that the proposed algorithm shows the comparative clustering performance compared with recently proposed hierarchical clustering algorithms.
    Probabilistic Surrogate Networks for Simulators with Unbounded Randomness. (arXiv:1910.11950v2 [cs.LG] UPDATED)
    (0 min) We present a framework for automatically structuring and training fast, approximate, deep neural surrogates of stochastic simulators. Unlike traditional approaches to surrogate modeling, our surrogates retain the interpretable structure and control flow of the reference simulator. Our surrogates target stochastic simulators where the number of random variables itself can be stochastic and potentially unbounded. Our framework further enables an automatic replacement of the reference simulator with the surrogate when undertaking amortized inference. The fidelity and speed of our surrogates allow for both faster stochastic simulation and accurate and substantially faster posterior inference. Using an illustrative yet non-trivial example we show our surrogates' ability to accurately model a probabilistic program with an unbounded number of random variables. We then proceed with an example that shows our surrogates are able to accurately model a complex structure like an unbounded stack in a program synthesis example. We further demonstrate how our surrogate modeling technique makes amortized inference in complex black-box simulators an order of magnitude faster. Specifically, we do simulator-based materials quality testing, inferring safety-critical latent internal temperature profiles of composite materials undergoing curing.
    Mis-spoke or mis-lead: Achieving Robustness in Multi-Agent Communicative Reinforcement Learning. (arXiv:2108.03803v2 [cs.LG] UPDATED)
    (0 min) Recent studies in multi-agent communicative reinforcement learning (MACRL) have demonstrated that multi-agent coordination can be greatly improved by allowing communication between agents. Meanwhile, adversarial machine learning (ML) has shown that ML models are vulnerable to attacks. Despite the increasing concern about the robustness of ML algorithms, how to achieve robust communication in multi-agent reinforcement learning has been largely neglected. In this paper, we systematically explore the problem of adversarial communication in MACRL. Our main contributions are threefold. First, we propose an effective method to perform attacks in MACRL, by learning a model to generate optimal malicious messages. Second, we develop a defence method based on message reconstruction, to maintain multi-agent coordination under message attacks. Third, we formulate the adversarial communication problem as a two-player zero-sum game and propose a game-theoretical method R-MACRL to improve the worst-case defending performance. Empirical results demonstrate that many state-of-the-art MACRL methods are vulnerable to message attacks, and our method can significantly improve their robustness.
    Combining optimal path search with task-dependent learning in a neural network. (arXiv:2201.11104v1 [cs.LG])
    (0 min) Finding optimal paths in connected graphs requires determining the smallest total cost for traveling along the graph's edges. This problem can be solved by several classical algorithms where, usually, costs are predefined for all edges. Conventional planning methods can, thus, normally not be used when wanting to change costs in an adaptive way following the requirements of some task. Here we show that one can define a neural network representation of path finding problems by transforming cost values into synaptic weights, which allows for online weight adaptation using network learning mechanisms. When starting with an initial activity value of one, activity propagation in this network will lead to solutions, which are identical to those found by the Bellman Ford algorithm. The neural network has the same algorithmic complexity as Bellman Ford and, in addition, we can show that network learning mechanisms (such as Hebbian learning) can adapt the weights in the network augmenting the resulting paths according to some task at hand. We demonstrate this by learning to navigate in an environment with obstacles as well as by learning to follow certain sequences of path nodes. Hence, the here-presented novel algorithm may open up a different regime of applications where path-augmentation (by learning) is directly coupled with path finding in a natural way.
    A probabilistic latent variable model for detecting structure in binary data. (arXiv:2201.11108v1 [stat.ML])
    (0 min) We introduce a novel, probabilistic binary latent variable model to detect noisy or approximate repeats of patterns in sparse binary data. The model is based on the "Noisy-OR model" (Heckerman, 1990), used previously for disease and topic modelling. The model's capability is demonstrated by extracting structure in recordings from retinal neurons, but it can be widely applied to discover and model latent structure in noisy binary data. In the context of spiking neural data, the task is to "explain" spikes of individual neurons in terms of groups of neurons, "Cell Assemblies" (CAs), that often fire together, due to mutual interactions or other causes. The model infers sparse activity in a set of binary latent variables, each describing the activity of a cell assembly. When the latent variable of a cell assembly is active, it reduces the probabilities of neurons belonging to this assembly to be inactive. The conditional probability kernels of the latent components are learned from the data in an expectation maximization scheme, involving inference of latent states and parameter adjustments to the model. We thoroughly validate the model on synthesized spike trains constructed to statistically resemble recorded retinal responses to white noise stimulus and natural movie stimulus in data. We also apply our model to spiking responses recorded in retinal ganglion cells (RGCs) during stimulation with a movie and discuss the found structure.
    S$^2$NN: Time Step Reduction of Spiking Surrogate Gradients for Training Energy Efficient Single-Step Neural Networks. (arXiv:2201.10879v1 [cs.LG])
    (0 min) As the scales of neural networks increase, techniques that enable them to run with low computational cost and energy efficiency are required. From such demands, various efficient neural network paradigms, such as spiking neural networks (SNNs) or binary neural networks (BNNs), have been proposed. However, they have sticky drawbacks, such as degraded inference accuracy and latency. To solve these problems, we propose a single-step neural network (S$^2$NN), an energy-efficient neural network with low computational cost and high precision. The proposed S$^2$NN processes the information between hidden layers by spikes as SNNs. Nevertheless, it has no temporal dimension so that there is no latency within training and inference phases as BNNs. Thus, the proposed S$^2$NN has a lower computational cost than SNNs that require time-series processing. However, S$^2$NN cannot adopt na\"{i}ve backpropagation algorithms due to the non-differentiability nature of spikes. We deduce a suitable neuron model by reducing the surrogate gradient for multi-time step SNNs to a single-time step. We experimentally demonstrated that the obtained neuron model enables S$^2$NN to train more accurately and energy-efficiently than existing neuron models for SNNs and BNNs. We also showed that the proposed S$^2$NN could achieve comparable accuracy to full-precision networks while being highly energy-efficient.
    Improving robustness and calibration in ensembles with diversity regularization. (arXiv:2201.10908v1 [cs.LG])
    (0 min) Calibration and uncertainty estimation are crucial topics in high-risk environments. We introduce a new diversity regularizer for classification tasks that uses out-of-distribution samples and increases the overall accuracy, calibration and out-of-distribution detection capabilities of ensembles. Following the recent interest in the diversity of ensembles, we systematically evaluate the viability of explicitly regularizing ensemble diversity to improve calibration on in-distribution data as well as under dataset shift. We demonstrate that diversity regularization is highly beneficial in architectures, where weights are partially shared between the individual members and even allows to use fewer ensemble members to reach the same level of robustness. Experiments on CIFAR-10, CIFAR-100, and SVHN show that regularizing diversity can have a significant impact on calibration and robustness, as well as out-of-distribution detection.
    PySINDy: A comprehensive Python package for robust sparse system identification. (arXiv:2111.08481v2 [eess.SY] UPDATED)
    (0 min) Automated data-driven modeling, the process of directly discovering the governing equations of a system from data, is increasingly being used across the scientific community. PySINDy is a Python package that provides tools for applying the sparse identification of nonlinear dynamics (SINDy) approach to data-driven model discovery. In this major update to PySINDy, we implement several advanced features that enable the discovery of more general differential equations from noisy and limited data. The library of candidate terms is extended for the identification of actuated systems, partial differential equations (PDEs), and implicit differential equations. Robust formulations, including the integral form of SINDy and ensembling techniques, are also implemented to improve performance for real-world data. Finally, we provide a range of new optimization algorithms, including several sparse regression techniques and algorithms to enforce and promote inequality constraints and stability. Together, these updates enable entirely new SINDy model discovery capabilities that have not been reported in the literature, such as constrained PDE identification and ensembling with different sparse regression optimizers.
    Robust Implicit Networks via Non-Euclidean Contractions. (arXiv:2106.03194v6 [cs.LG] UPDATED)
    (0 min) Implicit neural networks, a.k.a., deep equilibrium networks, are a class of implicit-depth learning models where function evaluation is performed by solving a fixed point equation. They generalize classic feedforward models and are equivalent to infinite-depth weight-tied feedforward networks. While implicit models show improved accuracy and significant reduction in memory consumption, they can suffer from ill-posedness and convergence instability. This paper provides a new framework, which we call Non-Euclidean Monotone Operator Network (NEMON), to design well-posed and robust implicit neural networks based upon contraction theory for the non-Euclidean norm $\ell_{\infty}$. Our framework includes (i) a novel condition for well-posedness based on one-sided Lipschitz constants, (ii) an average iteration for computing fixed-points, and (iii) explicit estimates on input-output Lipschitz constants. Additionally, we design a training problem with the well-posedness condition and the average iteration as constraints and, to achieve robust models, with the input-output Lipschitz constant as a regularizer. Our $\ell_{\infty}$ well-posedness condition leads to a larger polytopic training search space than existing conditions and our average iteration enjoys accelerated convergence. Finally, we evaluate our framework in image classification through the MNIST and the CIFAR-10 datasets. Our numerical results demonstrate improved accuracy and robustness of the implicit models with smaller input-output Lipschitz bounds. Code is available at https://github.com/davydovalexander/Non-Euclidean_Mon_Op_Net.
    Using a Novel COVID-19 Calculator to Measure Positive U.S. Socio-Economic Impact of a COVID-19 Pre-Screening Solution (AI/ML). (arXiv:2201.11109v1 [cs.AI])
    (0 min) The COVID-19 pandemic has been a scourge upon humanity, claiming the lives of more than 5.1 million people worldwide; the global economy contracted by 3.5% in 2020. This paper presents a COVID-19 calculator, synthesizing existing published calculators and data points, to measure the positive U.S. socio-economic impact of a COVID-19 AI/ML pre-screening solution (algorithm & application).
    Natural Language Descriptions of Deep Visual Features. (arXiv:2201.11114v1 [cs.CV])
    (0 min) Some neurons in deep networks specialize in recognizing highly specific perceptual, structural, or semantic features of inputs. In computer vision, techniques exist for identifying neurons that respond to individual concept categories like colors, textures, and object classes. But these techniques are limited in scope, labeling only a small subset of neurons and behaviors in any network. Is a richer characterization of neuron-level computation possible? We introduce a procedure (called MILAN, for mutual-information-guided linguistic annotation of neurons) that automatically labels neurons with open-ended, compositional, natural language descriptions. Given a neuron, MILAN generates a description by searching for a natural language string that maximizes pointwise mutual information with the image regions in which the neuron is active. MILAN produces fine-grained descriptions that capture categorical, relational, and logical structure in learned features. These descriptions obtain high agreement with human-generated feature descriptions across a diverse set of model architectures and tasks, and can aid in understanding and controlling learned models. We highlight three applications of natural language neuron descriptions. First, we use MILAN for analysis, characterizing the distribution and importance of neurons selective for attribute, category, and relational information in vision models. Second, we use MILAN for auditing, surfacing neurons sensitive to protected categories like race and gender in models trained on datasets intended to obscure these features. Finally, we use MILAN for editing, improving robustness in an image classifier by deleting neurons sensitive to text features spuriously correlated with class labels.
    Generalization in Supervised Learning Through Riemannian Contraction. (arXiv:2201.06656v2 [cs.LG] UPDATED)
    (0 min) We prove that Riemannian contraction in a supervised learning setting implies generalization. Specifically, we show that if an optimizer is contracting in some Riemannian metric with rate $\lambda > 0$, it is uniformly algorithmically stable with rate $\mathcal{O}(1/\lambda n)$, where $n$ is the number of labelled examples in the training set. The results hold for stochastic and deterministic optimization, in both continuous and discrete-time, for convex and non-convex loss surfaces. The associated generalization bounds reduce to well-known results in the particular case of gradient descent over convex or strongly convex loss surfaces. They can be shown to be optimal in certain linear settings, such as kernel ridge regression under gradient flow.
    Three approaches to facilitate DNN generalization to objects in out-of-distribution orientations and illuminations. (arXiv:2111.00131v2 [cs.CV] UPDATED)
    (3 min) The training data distribution is often biased towards objects in certain orientations and illumination conditions. While humans have a remarkable capability of recognizing objects in out-of-distribution (OoD) orientations and illuminations, Deep Neural Networks (DNNs) severely suffer in this case, even when large amounts of training examples are available. In this paper, we investigate three different approaches to improve DNNs in recognizing objects in OoD orientations and illuminations. Namely, these are (i) training much longer after convergence of the in-distribution (InD) validation accuracy, i.e., late-stopping, (ii) tuning the momentum parameter of the batch normalization layers, and (iii) enforcing invariance of the neural activity in an intermediate layer to orientation and illumination conditions. Each of these approaches substantially improves the DNN's OoD accuracy (more than 20% in some cases). We report results in four datasets: two datasets are modified from the MNIST and iLab datasets, and the other two are novel (one of 3D rendered cars and another of objects taken from various controlled orientations and illumination conditions). These datasets allow to study the effects of different amounts of bias and are challenging as DNNs perform poorly in OoD conditions. Finally, we demonstrate that even though the three approaches focus on different aspects of DNNs, they all tend to lead to the same underlying neural mechanism to enable OoD accuracy gains --individual neurons in the intermediate layers become more selective to a category and also invariant to OoD orientations and illuminations. We anticipate this study to be a basis for further improvement of deep neural networks' OoD generalization performance, which is highly demanded to achieve safe and fair AI applications.
    Self-Directed Online Machine Learning for Topology Optimization. (arXiv:2002.01927v8 [cs.CE] UPDATED)
    (2 min) Topology optimization by optimally distributing materials in a given domain requires non-gradient optimizers to solve highly complicated problems. However, with hundreds of design variables or more involved, solving such problems would require millions of Finite Element Method (FEM) calculations whose computational cost is huge and impractical. Here we report Self-directed Online Learning Optimization (SOLO) which integrates Deep Neural Network (DNN) with FEM calculations. A DNN learns and substitutes the objective as a function of design variables. A small number of training data is generated dynamically based on the DNN's prediction of the optimum. The DNN adapts to the new training data and gives better prediction in the region of interest until convergence. The optimum predicted by the DNN is proved to converge to the true global optimum through iterations. Our algorithm was tested by four types of problems including compliance minimization, fluid-structure optimization, heat transfer enhancement and truss optimization. It reduced the computational time by 2 ~ 5 orders of magnitude compared with directly using heuristic methods, and outperformed all state-of-the-art algorithms tested in our experiments. This approach enables solving large multi-dimensional optimization problems.
    Finite-Sample Analysis of Nonlinear Stochastic Approximation with Applications in Reinforcement Learning. (arXiv:1905.11425v7 [math.OC] UPDATED)
    (2 min) Motivated by applications in reinforcement learning (RL), we study a nonlinear stochastic approximation (SA) algorithm under Markovian noise, and establish its finite-sample convergence bounds under various stepsizes. Specifically, we show that when using constant stepsize (i.e., $\alpha_k\equiv \alpha$), the algorithm achieves exponential fast convergence to a neighborhood (with radius $O(\alpha\log(1/\alpha))$) around the desired limit point. When using diminishing stepsizes with appropriate decay rate, the algorithm converges with rate $O(\log(k)/k)$. Our proof is based on Lyapunov drift arguments, and to handle the Markovian noise, we exploit the fast mixing of the underlying Markov chain. To demonstrate the generality of our theoretical results on Markovian SA, we use it to derive the finite-sample bounds of the popular $Q$-learning with linear function approximation algorithm, under a condition on the behavior policy. Importantly, we do not need to make the assumption that the samples are i.i.d., and do not require an artificial projection step in the algorithm to maintain the boundedness of the iterates. Numerical simulations corroborate our theoretical results.
    Exploiting Hybrid Models of Tensor-Train Networks for Spoken Command Recognition. (arXiv:2201.10609v1 [cs.SD])
    (2 min) This work aims to design a low complexity spoken command recognition (SCR) system by considering different trade-offs between the number of model parameters and classification accuracy. More specifically, we exploit a deep hybrid architecture of a tensor-train (TT) network to build an end-to-end SRC pipeline. Our command recognition system, namely CNN+(TT-DNN), is composed of convolutional layers at the bottom for spectral feature extraction and TT layers at the top for command classification. Compared with a traditional end-to-end CNN baseline for SCR, our proposed CNN+(TT-DNN) model replaces fully connected (FC) layers with TT ones and it can substantially reduce the number of model parameters while maintaining the baseline performance of the CNN model. We initialize the CNN+(TT-DNN) model in a randomized manner or based on a well-trained CNN+DNN, and assess the CNN+(TT-DNN) models on the Google Speech Command Dataset. Our experimental results show that the proposed CNN+(TT-DNN) model attains a competitive accuracy of 96.31% with 4 times fewer model parameters than the CNN model. Furthermore, the CNN+(TT-DNN) model can obtain a 97.2% accuracy when the number of parameters is increased.
    An Automated Question-Answering Framework Based on Evolution Algorithm. (arXiv:2201.10797v1 [cs.CL])
    (2 min) Building a deep learning model for a Question-Answering (QA) task requires a lot of human effort, it may need several months to carefully tune various model architectures and find a best one. It's even harder to find different excellent models for multiple datasets. Recent works show that the best model structure is related to the dataset used, and one single model cannot adapt to all tasks. In this paper, we propose an automated Question-Answering framework, which could automatically adjust network architecture for multiple datasets. Our framework is based on an innovative evolution algorithm, which is stable and suitable for multiple dataset scenario. The evolution algorithm for search combine prior knowledge into initial population and use a performance estimator to avoid inefficient mutation by predicting the performance of candidate model architecture. The prior knowledge used in initial population could improve the final result of the evolution algorithm. The performance estimator could quickly filter out models with bad performance in population as the number of trials increases, to speed up the convergence. Our framework achieves 78.9 EM and 86.1 F1 on SQuAD 1.1, 69.9 EM and 72.5 F1 on SQuAD 2.0. On NewsQA dataset, the found model achieves 47.0 EM and 62.9 F1.
    Physics-informed ConvNet: Learning Physical Field from a Shallow Neural Network. (arXiv:2201.10967v1 [cs.LG])
    (2 min) Big-data-based artificial intelligence (AI) supports profound evolution in almost all of science and technology. However, modeling and forecasting multi-physical systems remain a challenge due to unavoidable data scarcity and noise. Improving the generalization ability of neural networks by "teaching" domain knowledge and developing a new generation of models combined with the physical laws have become promising areas of machine learning research. Different from "deep" fully-connected neural networks embedded with physical information (PINN), a novel shallow framework named physics-informed convolutional network (PICN) is recommended from a CNN perspective, in which the physical field is generated by a deconvolution layer and a single convolution layer. The difference fields forming the physical operator are constructed using the pre-trained shallow convolution layer. An efficient linear interpolation network calculates the loss function involving boundary conditions and the physical constraints in irregular geometry domains. The effectiveness of the current development is illustrated through some numerical cases involving the solving (and estimation) of nonlinear physical operator equations and recovering physical information from noisy observations. Its potential advantage in approximating physical fields with multi-frequency components indicates that PICN may become an alternative neural network solver in physics-informed machine learning.
    Server-Side Stepsizes and Sampling Without Replacement Provably Help in Federated Optimization. (arXiv:2201.11066v1 [cs.LG])
    (2 min) We present a theoretical study of server-side optimization in federated learning. Our results are the first to show that the widely popular heuristic of scaling the client updates with an extra parameter is very useful in the context of Federated Averaging (FedAvg) with local passes over the client data. Each local pass is performed without replacement using Random Reshuffling, which is a key reason we can show improved complexities. In particular, we prove that whenever the local stepsizes are small, and the update direction is given by FedAvg in conjunction with Random Reshuffling over all clients, one can take a big leap in the obtained direction and improve rates for convex, strongly convex, and non-convex objectives. In particular, in non-convex regime we get an enhancement of the rate of convergence from $\mathcal{O}\left(\varepsilon^{-3}\right)$ to $\mathcal{O}\left(\varepsilon^{-2}\right)$. This result is new even for Random Reshuffling performed on a single node. In contrast, if the local stepsizes are large, we prove that the noise of client sampling can be controlled by using a small server-side stepsize. To the best of our knowledge, this is the first time that local steps provably help to overcome the communication bottleneck. Together, our results on the advantage of large and small server-side stepsizes give a formal justification for the practice of adaptive server-side optimization in federated learning. Moreover, we consider a variant of our algorithm that supports partial client participation, which makes the method more practical.
    Confidence-Aware Imitation Learning from Demonstrations with Varying Optimality. (arXiv:2110.14754v2 [cs.LG] UPDATED)
    (2 min) Most existing imitation learning approaches assume the demonstrations are drawn from experts who are optimal, but relaxing this assumption enables us to use a wider range of data. Standard imitation learning may learn a suboptimal policy from demonstrations with varying optimality. Prior works use confidence scores or rankings to capture beneficial information from demonstrations with varying optimality, but they suffer from many limitations, e.g., manually annotated confidence scores or high average optimality of demonstrations. In this paper, we propose a general framework to learn from demonstrations with varying optimality that jointly learns the confidence score and a well-performing policy. Our approach, Confidence-Aware Imitation Learning (CAIL) learns a well-performing policy from confidence-reweighted demonstrations, while using an outer loss to track the performance of our model and to learn the confidence. We provide theoretical guarantees on the convergence of CAIL and evaluate its performance in both simulated and real robot experiments. Our results show that CAIL significantly outperforms other imitation learning methods from demonstrations with varying optimality. We further show that even without access to any optimal demonstrations, CAIL can still learn a successful policy, and outperforms prior work.
    Post-training Quantization for Neural Networks with Provable Guarantees. (arXiv:2201.11113v1 [cs.LG])
    (2 min) While neural networks have been remarkably successful in a wide array of applications, implementing them in resource-constrained hardware remains an area of intense research. By replacing the weights of a neural network with quantized (e.g., 4-bit, or binary) counterparts, massive savings in computation cost, memory, and power consumption are attained. We modify a post-training neural-network quantization method, GPFQ, that is based on a greedy path-following mechanism, and rigorously analyze its error. We prove that for quantizing a single-layer network, the relative square error essentially decays linearly in the number of weights -- i.e., level of over-parametrization. Our result holds across a range of input distributions and for both fully-connected and convolutional architectures. To empirically evaluate the method, we quantize several common architectures with few bits per weight, and test them on ImageNet, showing only minor loss of accuracy. We also demonstrate that standard modifications, such as bias correction and mixed precision quantization, further improve accuracy.
    Autoregressive neural-network wavefunctions for ab initio quantum chemistry. (arXiv:2109.12606v2 [physics.chem-ph] UPDATED)
    (2 min) In recent years, neural network quantum states (NNQS) have emerged as powerful tools for the study of quantum many-body systems. Electronic structure calculations are one such canonical many-body problem that have attracted significant research efforts spanning multiple decades, whilst only recently being attempted with NNQS. However, the complex non-local interactions and high sample complexity are significant challenges that call for bespoke solutions. Here, we parameterise the electronic wavefunction with a novel autoregressive neural network (ARN) that permits highly efficient and scalable sampling, whilst also embedding physical priors reflecting the structure of molecular systems without sacrificing expressibility. This allows us to perform electronic structure calculations on molecules with up to 30 spin-orbitals -- at least an order of magnitude more Slater determinants than previous applications of conventional NNQS -- and we find that our ansatz can outperform the de-facto gold-standard coupled cluster methods even in the presence of strong quantum correlations. With a highly expressive neural network for which sampling is no longer a computational bottleneck, we conclude that the barriers to further scaling are not associated with the wavefunction ansatz itself, but rather are inherent to any variational Monte Carlo approach.
    Enabling Deep Learning on Edge Devices through Filter Pruning and Knowledge Transfer. (arXiv:2201.10947v1 [cs.LG])
    (2 min) Deep learning models have introduced various intelligent applications to edge devices, such as image classification, speech recognition, and augmented reality. There is an increasing need of training such models on the devices in order to deliver personalized, responsive, and private learning. To address this need, this paper presents a new solution for deploying and training state-of-the-art models on the resource-constrained devices. First, the paper proposes a novel filter-pruning-based model compression method to create lightweight trainable models from large models trained in the cloud, without much loss of accuracy. Second, it proposes a novel knowledge transfer method to enable the on-device model to update incrementally in real time or near real time using incremental learning on new data and enable the on-device model to learn the unseen categories with the help of the in-cloud model in an unsupervised fashion. The results show that 1) our model compression method can remove up to 99.36% parameters of WRN-28-10, while preserving a Top-1 accuracy of over 90% on CIFAR-10; 2) our knowledge transfer method enables the compressed models to achieve more than 90% accuracy on CIFAR-10 and retain good accuracy on old categories; 3) it allows the compressed models to converge within real time (three to six minutes) on the edge for incremental learning tasks; 4) it enables the model to classify unseen categories of data (78.92% Top-1 accuracy) that it is never trained with.
    Personalized Federated Learning with Moreau Envelopes. (arXiv:2006.08848v3 [cs.LG] UPDATED)
    (2 min) Federated learning (FL) is a decentralized and privacy-preserving machine learning technique in which a group of clients collaborate with a server to learn a global model without sharing clients' data. One challenge associated with FL is statistical diversity among clients, which restricts the global model from delivering good performance on each client's task. To address this, we propose an algorithm for personalized FL (pFedMe) using Moreau envelopes as clients' regularized loss functions, which help decouple personalized model optimization from the global model learning in a bi-level problem stylized for personalized FL. Theoretically, we show that pFedMe's convergence rate is state-of-the-art: achieving quadratic speedup for strongly convex and sublinear speedup of order 2/3 for smooth nonconvex objectives. Experimentally, we verify that pFedMe excels at empirical performance compared with the vanilla FedAvg and Per-FedAvg, a meta-learning based personalized FL algorithm.
    Alleviating Cold-start Problem in CTR Prediction with A Variational Embedding Learning Framework. (arXiv:2201.10980v1 [cs.IR])
    (2 min) We propose a general Variational Embedding Learning Framework (VELF) for alleviating the severe cold-start problem in CTR prediction. VELF addresses the cold start problem via alleviating over-fits caused by data-sparsity in two ways: learning probabilistic embedding, and incorporating trainable and regularized priors which utilize the rich side information of cold start users and advertisements (Ads). The two techniques are naturally integrated into a variational inference framework, forming an end-to-end training process. Abundant empirical tests on benchmark datasets well demonstrate the advantages of our proposed VELF. Besides, extended experiments confirmed that our parameterized and regularized priors provide more generalization capability than traditional fixed priors.
    Hyperparameter Optimization for COVID-19 Chest X-Ray Classification. (arXiv:2201.10885v1 [eess.IV])
    (2 min) Despite the introduction of vaccines, Coronavirus disease (COVID-19) remains a worldwide dilemma, continuously developing new variants such as Delta and the recent Omicron. The current standard for testing is through polymerase chain reaction (PCR). However, PCRs can be expensive, slow, and/or inaccessible to many people. X-rays on the other hand have been readily used since the early 20th century and are relatively cheaper, quicker to obtain, and typically covered by health insurance. With a careful selection of model, hyperparameters, and augmentations, we show that it is possible to develop models with 83% accuracy in binary classification and 64% in multi-class for detecting COVID-19 infections from chest x-rays.
    Visualizing the diversity of representations learned by Bayesian neural networks. (arXiv:2201.10859v1 [cs.LG])
    (2 min) Explainable artificial intelligence (XAI) aims to make learning machines less opaque, and offers researchers and practitioners various tools to reveal the decision-making strategies of neural networks. In this work, we investigate how XAI methods can be used for exploring and visualizing the diversity of feature representations learned by Bayesian neural networks (BNNs). Our goal is to provide a global understanding of BNNs by making their decision-making strategies a) visible and tangible through feature visualizations and b) quantitatively measurable with a distance measure learned by contrastive learning. Our work provides new insights into the posterior distribution in terms of human-understandable feature information with regard to the underlying decision-making strategies. Our main findings are the following: 1) global XAI methods can be applied to explain the diversity of decision-making strategies of BNN instances, 2) Monte Carlo dropout exhibits increased diversity in feature representations compared to the multimodal posterior approximation of MultiSWAG, 3) the diversity of learned feature representations highly correlates with the uncertainty estimates, and 4) the inter-mode diversity of the multimodal posterior decreases as the network width increases, while the intra-mode diversity increases. Our findings are consistent with the recent deep neural networks theory, providing additional intuitions about what the theory implies in terms of humanly understandable concepts.
    An Efficient and Robust System for Vertically Federated Random Forest. (arXiv:2201.10761v1 [cs.LG])
    (2 min) As there is a growing interest in utilizing data across multiple resources to build better machine learning models, many vertically federated learning algorithms have been proposed to preserve the data privacy of the participating organizations. However, the efficiency of existing vertically federated learning algorithms remains to be a big problem, especially when applied to large-scale real-world datasets. In this paper, we present a fast, accurate, scalable and yet robust system for vertically federated random forest. With extensive optimization, we achieved $5\times$ and $83\times$ speed up over the SOTA SecureBoost model \cite{cheng2019secureboost} for training and serving tasks. Moreover, the proposed system can achieve similar accuracy but with favorable scalability and partition tolerance. Our code has been made public to facilitate the development of the community and the protection of user data privacy.
    On the Power of Gradual Network Alignment Using Dual-Perception Similarities. (arXiv:2201.10945v1 [cs.SI])
    (2 min) Network alignment (NA) is the task of finding the correspondence of nodes between two networks based on the network structure and node attributes. Our study is motivated by the fact that, since most of existing NA methods have attempted to discover all node pairs at once, they do not harness information enriched through interim discovery of node correspondences to more accurately find the next correspondences during the node matching. To tackle this challenge, we propose Grad-Align, a new NA method that gradually discovers node pairs by making full use of node pairs exhibiting strong consistency, which are easy to be discovered in the early stage of gradual matching. Specifically, Grad-Align first generates node embeddings of the two networks based on graph neural networks along with our layer-wise reconstruction loss, a loss built upon capturing the first-order and higher-order neighborhood structures. Then, nodes are gradually aligned by computing dual-perception similarity measures including the multi-layer embedding similarity as well as the Tversky similarity, an asymmetric set similarity using the Tversky index applicable to networks with different scales. Additionally, we incorporate an edge augmentation module into Grad-Align to reinforce the structural consistency. Through comprehensive experiments using real-world and synthetic datasets, we empirically demonstrate that Grad-Align consistently outperforms state-of-the-art NA methods.
    Speeding up Heterogeneous Federated Learning with Sequentially Trained Superclients. (arXiv:2201.10899v1 [cs.LG])
    (2 min) Federated Learning (FL) allows training machine learning models in privacy-constrained scenarios by enabling the cooperation of edge devices without requiring local data sharing. This approach raises several challenges due to the different statistical distribution of the local datasets and the clients' computational heterogeneity. In particular, the presence of highly non-i.i.d. data severely impairs both the performance of the trained neural network and its convergence rate, increasing the number of communication rounds requested to reach a performance comparable to that of the centralized scenario. As a solution, we propose FedSeq, a novel framework leveraging the sequential training of subgroups of heterogeneous clients, i.e. superclients, to emulate the centralized paradigm in a privacy-compliant way. Given a fixed budget of communication rounds, we show that FedSeq outperforms or match several state-of-the-art federated algorithms in terms of final performance and speed of convergence. Finally, our method can be easily integrated with other approaches available in the literature. Empirical results show that combining existing algorithms with FedSeq further improves its final performance and convergence speed. We test our method on CIFAR-10 and CIFAR-100 and prove its effectiveness in both i.i.d. and non-i.i.d. scenarios.
    A deep learning method based on patchwise training for reconstructing temperature field. (arXiv:2201.10860v1 [cs.LG])
    (2 min) Physical field reconstruction is highly desirable for the measurement and control of engineering systems. The reconstruction of the temperature field from limited observation plays a crucial role in thermal management for electronic equipment. Deep learning has been employed in physical field reconstruction, whereas the accurate estimation for the regions with large gradients is still diffcult. To solve the problem, this work proposes a novel deep learning method based on patchwise training to reconstruct the temperature field of electronic equipment accurately from limited observation. Firstly, the temperature field reconstruction (TFR) problem of the electronic equipment is modeled mathematically and transformed as an image-to-image regression task. Then a patchwise training and inference framework consisting of an adaptive UNet and a shallow multilayer perceptron (MLP) is developed to establish the mapping from the observation to the temperature field. The adaptive UNet is utilized to reconstruct the whole temperature field while the MLP is designed to predict the patches with large temperature gradients. Experiments employing finite element simulation data are conducted to demonstrate the accuracy of the proposed method. Furthermore, the generalization is evaluated by investigating cases under different heat source layouts, different power intensities, and different observation point locations. The maximum absolute errors of the reconstructed temperature field are less than 1K under the patchwise training approach.
    Sparsity Regularization For Cold-Start Recommendation. (arXiv:2201.10711v1 [cs.IR])
    (2 min) Recently, Generative Adversarial Networks (GANs) have been applied to the problem of Cold-Start Recommendation, but the training performance of these models is hampered by the extreme sparsity in warm user purchase behavior. In this paper we introduce a novel representation for user-vectors by combining user demographics and user preferences, making the model a hybrid system which uses Collaborative Filtering and Content Based Recommendation. Our system models user purchase behavior using weighted user-product preferences (explicit feedback) rather than binary user-product interactions (implicit feedback). Using this we develop a novel sparse adversarial model, SRLGAN, for Cold-Start Recommendation leveraging the sparse user-purchase behavior which ensures training stability and avoids over-fitting on warm users. We evaluate the SRLGAN on two popular datasets and demonstrate state-of-the-art results.
    Image Generation with Self Pixel-wise Normalization. (arXiv:2201.10725v1 [cs.CV])
    (2 min) Region-adaptive normalization (RAN) methods have been widely used in the generative adversarial network (GAN)-based image-to-image translation technique. However, since these approaches need a mask image to infer the pixel-wise affine transformation parameters, they cannot be applied to the general image generation models having no paired mask images. To resolve this problem, this paper presents a novel normalization method, called self pixel-wise normalization (SPN), which effectively boosts the generative performance by performing the pixel-adaptive affine transformation without the mask image. In our method, the transforming parameters are derived from a self-latent mask that divides the feature map into the foreground and background regions. The visualization of the self-latent masks shows that SPN effectively captures a single object to be generated as the foreground. Since the proposed method produces the self-latent mask without external data, it is easily applicable in the existing generative models. Extensive experiments on various datasets reveal that the proposed method significantly improves the performance of image generation technique in terms of Frechet inception distance (FID) and Inception score (IS).
    One Student Knows All Experts Know: From Sparse to Dense. (arXiv:2201.10890v1 [cs.LG])
    (2 min) Human education system trains one student by multiple experts. Mixture-of-experts (MoE) is a powerful sparse architecture including multiple experts. However, sparse MoE model is hard to implement, easy to overfit, and not hardware-friendly. In this work, inspired by human education model, we propose a novel task, knowledge integration, to obtain a dense student model (OneS) as knowledgeable as one sparse MoE. We investigate this task by proposing a general training framework including knowledge gathering and knowledge distillation. Specifically, we first propose Singular Value Decomposition Knowledge Gathering (SVD-KG) to gather key knowledge from different pretrained experts. We then refine the dense student model by knowledge distillation to offset the noise from gathering. On ImageNet, our OneS preserves $61.7\%$ benefits from MoE. OneS can achieve $78.4\%$ top-1 accuracy with only $15$M parameters. On four natural language processing datasets, OneS obtains $88.2\%$ MoE benefits and outperforms SoTA by $51.7\%$ using the same architecture and training data. In addition, compared with the MoE counterpart, OneS can achieve $3.7 \times$ inference speedup due to the hardware-friendly architecture.
    Dual-Tasks Siamese Transformer Framework for Building Damage Assessment. (arXiv:2201.10953v1 [cs.CV])
    (2 min) Accurate and fine-grained information about the extent of damage to buildings is essential for humanitarian relief and disaster response. However, as the most commonly used architecture in remote sensing interpretation tasks, Convolutional Neural Networks (CNNs) have limited ability to model the non-local relationship between pixels. Recently, Transformer architecture first proposed for modeling long-range dependency in natural language processing has shown promising results in computer vision tasks. Considering the frontier advances of Transformer architecture in the computer vision field, in this paper, we present the first attempt at designing a Transformer-based damage assessment architecture (DamFormer). In DamFormer, a siamese Transformer encoder is first constructed to extract non-local and representative deep features from input multitemporal image-pairs. Then, a multitemporal fusion module is designed to fuse information for downstream tasks. Finally, a lightweight dual-tasks decoder aggregates multi-level features for final prediction. To the best of our knowledge, it is the first time that such a deep Transformer-based network is proposed for multitemporal remote sensing interpretation tasks. The experimental results on the large-scale damage assessment dataset xBD demonstrate the potential of the Transformer-based architecture.
    Class-Aware Generative Adversarial Transformers for Medical Image Segmentation. (arXiv:2201.10737v1 [cs.CV])
    (2 min) Transformers have made remarkable progress towards modeling long-range dependencies within the medical image analysis domain. However, current transformer-based models suffer from several disadvantages: 1) existing methods fail to capture the important features of the images due to the naive tokenization scheme; 2) the models suffer from information loss because they only consider single-scale feature representations; and 3) the segmentation label maps generated by the models are not accurate enough without considering rich semantic contexts and anatomical textures. In this work, we present CA-GANformer, a novel type of generative adversarial transformers, for medical image segmentation. First, we take advantage of the pyramid structure to construct multi-scale representations and handle multi-scale variations. We then design a novel class-aware transformer module to better learn the discriminative regions of objects with semantic structures. Lastly, we utilize an adversarial training strategy that boosts segmentation accuracy and correspondingly allows a transformer-based discriminator to capture high-level semantically correlated contents and low-level anatomical features. Our experiments demonstrate that CA-GANformer dramatically outperforms previous state-of-the-art transformer-based approaches on three benchmarks, obtaining absolute 2.54%-5.88% improvements in Dice over previous models. Further qualitative experiments provide a more detailed picture of the model's inner workings, shed light on the challenges in improved transparency, and demonstrate that transfer learning can greatly improve performance and reduce the size of medical image datasets in training, making CA-GANformer a strong starting point for downstream medical image analysis tasks. Codes and models will be available to the public.
    A Kernel Learning Method for Backward SDE Filter. (arXiv:2201.10600v1 [math.NA])
    (2 min) In this paper, we develop a kernel learning backward SDE filter method to estimate the state of a stochastic dynamical system based on its partial noisy observations. A system of forward backward stochastic differential equations is used to propagate the state of the target dynamical model, and Bayesian inference is applied to incorporate the observational information. To characterize the dynamical model in the entire state space, we introduce a kernel learning method to learn a continuous global approximation for the conditional probability density function of the target state by using discrete approximated density values as training data. Numerical experiments demonstrate that the kernel learning backward SDE is highly effective and highly efficient.
    Attentive Task Interaction Network for Multi-Task Learning. (arXiv:2201.10649v1 [cs.CV])
    (2 min) Multitask learning (MTL) has recently gained a lot of popularity as a learning paradigm that can lead to improved per-task performance while also using fewer per-task model parameters compared to single task learning. One of the biggest challenges regarding MTL networks involves how to share features across tasks. To address this challenge, we propose the Attentive Task Interaction Network (ATI-Net). ATI-Net employs knowledge distillation of the latent features for each task, then combines the feature maps to provide improved contextualized information to the decoder. This novel approach to introducing knowledge distillation into an attention based multitask network outperforms state of the art MTL baselines such as the standalone MTAN and PAD-Net, with roughly the same number of model parameters.
    Variational Model Inversion Attacks. (arXiv:2201.10787v1 [cs.LG])
    (2 min) Given the ubiquity of deep neural networks, it is important that these models do not reveal information about sensitive data that they have been trained on. In model inversion attacks, a malicious user attempts to recover the private dataset used to train a supervised neural network. A successful model inversion attack should generate realistic and diverse samples that accurately describe each of the classes in the private dataset. In this work, we provide a probabilistic interpretation of model inversion attacks, and formulate a variational objective that accounts for both diversity and accuracy. In order to optimize this variational objective, we choose a variational family defined in the code space of a deep generative model, trained on a public auxiliary dataset that shares some structural similarity with the target dataset. Empirically, our method substantially improves performance in terms of target attack accuracy, sample realism, and diversity on datasets of faces and chest X-ray images.
    Phishing Attacks Detection -- A Machine Learning-Based Approach. (arXiv:2201.10752v1 [cs.CR])
    (2 min) Phishing attacks are one of the most common social engineering attacks targeting users emails to fraudulently steal confidential and sensitive information. They can be used as a part of more massive attacks launched to gain a foothold in corporate or government networks. Over the last decade, a number of anti-phishing techniques have been proposed to detect and mitigate these attacks. However, they are still inefficient and inaccurate. Thus, there is a great need for efficient and accurate detection techniques to cope with these attacks. In this paper, we proposed a phishing attack detection technique based on machine learning. We collected and analyzed more than 4000 phishing emails targeting the email service of the University of North Dakota. We modeled these attacks by selecting 10 relevant features and building a large dataset. This dataset was used to train, validate, and test the machine learning algorithms. For performance evaluation, four metrics have been used, namely probability of detection, probability of miss-detection, probability of false alarm, and accuracy. The experimental results show that better detection can be achieved using an artificial neural network.
    Learning Multiple Probabilistic Degradation Generators for Unsupervised Real World Image Super Resolution. (arXiv:2201.10747v1 [eess.IV])
    (2 min) Unsupervised real world super resolution (USR) aims at restoring high-resolution (HR) images given low-resolution (LR) inputs when paired data is unavailable. One of the most common approaches is synthesizing noisy LR images using GANs and utilizing a synthetic dataset to train the model in a supervised manner. The goal of modeling the degradation generator is to approximate the distribution of LR images given a HR image. Previous works simply assumed the conditional distribution as a delta function and learned the deterministic mapping from HR image to a LR image. Instead, we propose the probabilistic degradation generator. Our degradation generator is a deep hierarchical latent variable model and more suitable for modeling the complex distribution. Furthermore, we train multiple degradation generators to enhance the mode coverage and apply the novel collaborative learning. We outperform several baselines on benchmark datasets in terms of PSNR and SSIM and demonstrate the robustness of our method on unseen data distribution.
    Towards Sharp Stochastic Zeroth Order Hessian Estimators over Riemannian Manifolds. (arXiv:2201.10780v1 [stat.ML])
    (2 min) We study Hessian estimators for real-valued functions defined over an $n$-dimensional complete Riemannian manifold. We introduce new stochastic zeroth-order Hessian estimators using $O (1)$ function evaluations. We show that, for a smooth real-valued function $f$ with Lipschitz Hessian (with respect to the Rimannian metric), our estimator achieves a bias bound of order $ O \left( L_2 \delta + \gamma \delta^2 \right) $, where $ L_2 $ is the Lipschitz constant for the Hessian, $ \gamma $ depends on both the Levi-Civita connection and function $f$, and $\delta$ is the finite difference step size. To the best of our knowledge, our results provide the first bias bound for Hessian estimators that explicitly depends on the geometry of the underlying Riemannian manifold. Perhaps more importantly, our bias bound does not increase with dimension $n$. This improves best previously known bias bound for $O(1)$-evaluation Hessian estimators, which increases quadratically with $n$. We also study downstream computations based on our Hessian estimators. The supremacy of our method is evidenced by empirical evaluations.
    Extending compositional data analysis from a graph signal processing perspective. (arXiv:2201.10610v1 [stat.ME])
    (2 min) Traditional methods for the analysis of compositional data consider the log-ratios between all different pairs of variables with equal weight, typically in the form of aggregated contributions. This is not meaningful in contexts where it is known that a relationship only exists between very specific variables (e.g.~for metabolomic pathways), while for other pairs a relationship does not exist. Modeling absence or presence of relationships is done in graph theory, where the vertices represent the variables, and the connections refer to relations. This paper links compositional data analysis with graph signal processing, and it extends the Aitchison geometry to a setting where only selected log-ratios can be considered. The presented framework retains the desirable properties of scale invariance and compositional coherence. An additional extension to include absolute information is readily made. Examples from bioinformatics and geochemistry underline the usefulness of thisapproach in comparison to standard methods for compositional data analysis.
  • stat.ML updates on arXiv.org

    Explicit construction of the minimum error variance estimator for stochastic LTI state-space systems. (arXiv:2109.02384v2 [math.OC] UPDATED)
    (2 min) In this short article, we showcase the derivation of the optimal (minimum error variance) estimator, when one part of the stochastic LTI system output is not measured but is able to be predicted from the measured system outputs. Similar derivations have been done before but not using state-space representation.
    ReLU Regression with Massart Noise. (arXiv:2109.04623v2 [cs.LG] UPDATED)
    (2 min) We study the fundamental problem of ReLU regression, where the goal is to fit Rectified Linear Units (ReLUs) to data. This supervised learning task is efficiently solvable in the realizable setting, but is known to be computationally hard with adversarial label noise. In this work, we focus on ReLU regression in the Massart noise model, a natural and well-studied semi-random noise model. In this model, the label of every point is generated according to a function in the class, but an adversary is allowed to change this value arbitrarily with some probability, which is {\em at most} $\eta < 1/2$. We develop an efficient algorithm that achieves exact parameter recovery in this model under mild anti-concentration assumptions on the underlying distribution. Such assumptions are necessary for exact recovery to be information-theoretically possible. We demonstrate that our algorithm significantly outperforms naive applications of $\ell_1$ and $\ell_2$ regression on both synthetic and real data.
    Learnable Wavelet Packet Transform for Data-Adapted Spectrograms. (arXiv:2201.11069v1 [cs.SD])
    (0 min) Capturing high-frequency data concerning the condition of complex systems, e.g. by acoustic monitoring, has become increasingly prevalent. Such high-frequency signals typically contain time dependencies ranging over different time scales and different types of cyclic behaviors. Processing such signals requires careful feature engineering, particularly the extraction of meaningful time-frequency features. This can be time-consuming and the performance is often dependent on the choice of parameters. To address these limitations, we propose a deep learning framework for learnable wavelet packet transforms, enabling to learn features automatically from data and optimise them with respect to the defined objective function. The learned features can be represented as a spectrogram, containing the important time-frequency information of the dataset. We evaluate the properties and performance of the proposed approach by evaluating its improved spectral leakage and by applying it to an anomaly detection task for acoustic monitoring.
    Extending compositional data analysis from a graph signal processing perspective. (arXiv:2201.10610v1 [stat.ME])
    (0 min) Traditional methods for the analysis of compositional data consider the log-ratios between all different pairs of variables with equal weight, typically in the form of aggregated contributions. This is not meaningful in contexts where it is known that a relationship only exists between very specific variables (e.g.~for metabolomic pathways), while for other pairs a relationship does not exist. Modeling absence or presence of relationships is done in graph theory, where the vertices represent the variables, and the connections refer to relations. This paper links compositional data analysis with graph signal processing, and it extends the Aitchison geometry to a setting where only selected log-ratios can be considered. The presented framework retains the desirable properties of scale invariance and compositional coherence. An additional extension to include absolute information is readily made. Examples from bioinformatics and geochemistry underline the usefulness of thisapproach in comparison to standard methods for compositional data analysis.
    Concentration bounds for the empirical angular measure with statistical learning applications. (arXiv:2104.03966v2 [math.ST] UPDATED)
    (2 min) The angular measure on the unit sphere characterizes the first-order dependence structure of the components of a random vector in extreme regions and is defined in terms of standardized margins. Its statistical recovery is an important step in learning problems involving observations far away from the center. In the common situation that the components of the vector have different distributions, the rank transformation offers a convenient and robust way of standardizing data in order to build an empirical version of the angular measure based on the most extreme observations. However, the study of the sampling distribution of the resulting empirical angular measure is challenging. It is the purpose of the paper to establish finite-sample bounds for the maximal deviations between the empirical and true angular measures, uniformly over classes of Borel sets of controlled combinatorial complexity. The bounds are valid with high probability and, up to logarithmic factors, scale as the square root of the effective sample size. The bounds are applied to provide performance guarantees for two statistical learning procedures tailored to extreme regions of the input space and built upon the empirical angular measure: binary classification in extreme regions through empirical risk minimization and unsupervised anomaly detection through minimum-volume sets of the sphere.
    Splintering with distributions: A stochastic decoy scheme for private computation. (arXiv:2007.02719v3 [cs.LG] UPDATED)
    (2 min) Performing computations while maintaining privacy is an important problem in todays distributed machine learning solutions. Consider the following two set ups between a client and a server, where in setup i) the client has a public data vector $\mathbf{x}$, the server has a large private database of data vectors $\mathcal{B}$ and the client wants to find the inner products $\langle \mathbf{x,y_k} \rangle, \forall \mathbf{y_k} \in \mathcal{B}$. The client does not want the server to learn $\mathbf{x}$ while the server does not want the client to learn the records in its database. This is in contrast to another setup ii) where the client would like to perform an operation solely on its data, such as computation of a matrix inverse on its data matrix $\mathbf{M}$, but would like to use the superior computing ability of the server to do so without having to leak $\mathbf{M}$ to the server. \par We present a stochastic scheme for splitting the client data into privatized shares that are transmitted to the server in such settings. The server performs the requested operations on these shares instead of on the raw client data at the server. The obtained intermediate results are sent back to the client where they are assembled by the client to obtain the final result.
    Hardness of Random Optimization Problems for Boolean Circuits, Low-Degree Polynomials, and Langevin Dynamics. (arXiv:2004.12063v2 [cs.CC] UPDATED)
    (3 min) We consider the problem of finding nearly optimal solutions of optimization problems with random objective functions. Two concrete problems we consider are (a) optimizing the Hamiltonian of a spherical or Ising $p$-spin glass model, and (b) finding a large independent set in a sparse Erd\H{o}s-R\'{e}nyi graph. The following families of algorithms are considered: (a) low-degree polynomials of the input; (b) low-depth Boolean circuits; (c) the Langevin dynamics algorithm. We show that these families of algorithms fail to produce nearly optimal solutions with high probability. For the case of Boolean circuits, our results improve the state-of-the-art bounds known in circuit complexity theory (although we consider the search problem as opposed to the decision problem). Our proof uses the fact that these models are known to exhibit a variant of the overlap gap property (OGP) of near-optimal solutions. Specifically, for both models, every two solutions whose objectives are above a certain threshold are either close or far from each other. The crux of our proof is that the classes of algorithms we consider exhibit a form of stability. We show by an interpolation argument that stable algorithms cannot overcome the OGP barrier. The stability of Langevin dynamics is an immediate consequence of the well-posedness of stochastic differential equations. The stability of low-degree polynomials and Boolean circuits is established using tools from Gaussian and Boolean analysis -- namely hypercontractivity and total influence, as well as a novel lower bound for random walks avoiding certain subsets. In the case of Boolean circuits, the result also makes use of Linal-Mansour-Nisan's classical theorem. Our techniques apply more broadly to low influence functions and may apply more generally.
    Bayesian Learning via Neural Schr\"odinger-F\"ollmer Flows. (arXiv:2111.10510v6 [stat.ML] UPDATED)
    (2 min) In this work we explore a new framework for approximate Bayesian inference in large datasets based on stochastic control. We advocate stochastic control as a finite time and low variance alternative to popular steady-state methods such as stochastic gradient Langevin dynamics (SGLD). Furthermore, we discuss and adapt the existing theoretical guarantees of this framework and establish connections to already existing VI routines in SDE-based models.
    A Kernel-Based Approach for Modelling Gaussian Processes with Functional Information. (arXiv:2201.11023v1 [math.ST])
    (2 min) Gaussian processes are among the most useful tools in modeling continuous processes in machine learning and statistics. If the value of a process is known at a finite collection of points, one may use Gaussian processes to construct a surface which interpolates these values to be used for prediction and uncertainty quantification in other locations. However, it is not always the case that the available information is in the form of a finite collection of points. For example, boundary value problems contain information on the boundary of a domain, which is an uncountable collection of points that cannot be incorporated into typical Gaussian process techniques. In this paper we construct a Gaussian process model which utilizes reproducing kernel Hilbert spaces to unify the typical finite case with the case of having uncountable information by exploiting the equivalence of conditional expectation and orthogonal projections. We discuss this construction in statistical models, including numerical considerations and a proof of concept.
    Explaining Hyperparameter Optimization via Partial Dependence Plots. (arXiv:2111.04820v2 [cs.LG] UPDATED)
    (2 min) Automated hyperparameter optimization (HPO) can support practitioners to obtain peak performance in machine learning models. However, there is often a lack of valuable insights into the effects of different hyperparameters on the final model performance. This lack of explainability makes it difficult to trust and understand the automated HPO process and its results. We suggest using interpretable machine learning (IML) to gain insights from the experimental data obtained during HPO with Bayesian optimization (BO). BO tends to focus on promising regions with potential high-performance configurations and thus induces a sampling bias. Hence, many IML techniques, such as the partial dependence plot (PDP), carry the risk of generating biased interpretations. By leveraging the posterior uncertainty of the BO surrogate model, we introduce a variant of the PDP with estimated confidence bands. We propose to partition the hyperparameter space to obtain more confident and reliable PDPs in relevant sub-regions. In an experimental study, we provide quantitative evidence for the increased quality of the PDPs within sub-regions.
    Visualizing the diversity of representations learned by Bayesian neural networks. (arXiv:2201.10859v1 [cs.LG])
    (2 min) Explainable artificial intelligence (XAI) aims to make learning machines less opaque, and offers researchers and practitioners various tools to reveal the decision-making strategies of neural networks. In this work, we investigate how XAI methods can be used for exploring and visualizing the diversity of feature representations learned by Bayesian neural networks (BNNs). Our goal is to provide a global understanding of BNNs by making their decision-making strategies a) visible and tangible through feature visualizations and b) quantitatively measurable with a distance measure learned by contrastive learning. Our work provides new insights into the posterior distribution in terms of human-understandable feature information with regard to the underlying decision-making strategies. Our main findings are the following: 1) global XAI methods can be applied to explain the diversity of decision-making strategies of BNN instances, 2) Monte Carlo dropout exhibits increased diversity in feature representations compared to the multimodal posterior approximation of MultiSWAG, 3) the diversity of learned feature representations highly correlates with the uncertainty estimates, and 4) the inter-mode diversity of the multimodal posterior decreases as the network width increases, while the intra-mode diversity increases. Our findings are consistent with the recent deep neural networks theory, providing additional intuitions about what the theory implies in terms of humanly understandable concepts.
    Physics-informed ConvNet: Learning Physical Field from a Shallow Neural Network. (arXiv:2201.10967v1 [cs.LG])
    (2 min) Big-data-based artificial intelligence (AI) supports profound evolution in almost all of science and technology. However, modeling and forecasting multi-physical systems remain a challenge due to unavoidable data scarcity and noise. Improving the generalization ability of neural networks by "teaching" domain knowledge and developing a new generation of models combined with the physical laws have become promising areas of machine learning research. Different from "deep" fully-connected neural networks embedded with physical information (PINN), a novel shallow framework named physics-informed convolutional network (PICN) is recommended from a CNN perspective, in which the physical field is generated by a deconvolution layer and a single convolution layer. The difference fields forming the physical operator are constructed using the pre-trained shallow convolution layer. An efficient linear interpolation network calculates the loss function involving boundary conditions and the physical constraints in irregular geometry domains. The effectiveness of the current development is illustrated through some numerical cases involving the solving (and estimation) of nonlinear physical operator equations and recovering physical information from noisy observations. Its potential advantage in approximating physical fields with multi-frequency components indicates that PICN may become an alternative neural network solver in physics-informed machine learning.
    Graph Neural Networks for Charged Particle Tracking on FPGAs. (arXiv:2112.02048v2 [physics.ins-det] UPDATED)
    (2 min) The determination of charged particle trajectories in collisions at the CERN Large Hadron Collider (LHC) is an important but challenging problem, especially in the high interaction density conditions expected during the future high-luminosity phase of the LHC (HL-LHC). Graph neural networks (GNNs) are a type of geometric deep learning algorithm that has successfully been applied to this task by embedding tracker data as a graph -- nodes represent hits, while edges represent possible track segments -- and classifying the edges as true or fake track segments. However, their study in hardware- or software-based trigger applications has been limited due to their large computational cost. In this paper, we introduce an automated translation workflow, integrated into a broader tool called $\texttt{hls4ml}$, for converting GNNs into firmware for field-programmable gate arrays (FPGAs). We use this translation tool to implement GNNs for charged particle tracking, trained using the TrackML challenge dataset, on FPGAs with designs targeting different graph sizes, task complexites, and latency/throughput requirements. This work could enable the inclusion of charged particle tracking GNNs at the trigger level for HL-LHC experiments.
    A probabilistic latent variable model for detecting structure in binary data. (arXiv:2201.11108v1 [stat.ML])
    (2 min) We introduce a novel, probabilistic binary latent variable model to detect noisy or approximate repeats of patterns in sparse binary data. The model is based on the "Noisy-OR model" (Heckerman, 1990), used previously for disease and topic modelling. The model's capability is demonstrated by extracting structure in recordings from retinal neurons, but it can be widely applied to discover and model latent structure in noisy binary data. In the context of spiking neural data, the task is to "explain" spikes of individual neurons in terms of groups of neurons, "Cell Assemblies" (CAs), that often fire together, due to mutual interactions or other causes. The model infers sparse activity in a set of binary latent variables, each describing the activity of a cell assembly. When the latent variable of a cell assembly is active, it reduces the probabilities of neurons belonging to this assembly to be inactive. The conditional probability kernels of the latent components are learned from the data in an expectation maximization scheme, involving inference of latent states and parameter adjustments to the model. We thoroughly validate the model on synthesized spike trains constructed to statistically resemble recorded retinal responses to white noise stimulus and natural movie stimulus in data. We also apply our model to spiking responses recorded in retinal ganglion cells (RGCs) during stimulation with a movie and discuss the found structure.
    Probabilistic Surrogate Networks for Simulators with Unbounded Randomness. (arXiv:1910.11950v2 [cs.LG] UPDATED)
    (2 min) We present a framework for automatically structuring and training fast, approximate, deep neural surrogates of stochastic simulators. Unlike traditional approaches to surrogate modeling, our surrogates retain the interpretable structure and control flow of the reference simulator. Our surrogates target stochastic simulators where the number of random variables itself can be stochastic and potentially unbounded. Our framework further enables an automatic replacement of the reference simulator with the surrogate when undertaking amortized inference. The fidelity and speed of our surrogates allow for both faster stochastic simulation and accurate and substantially faster posterior inference. Using an illustrative yet non-trivial example we show our surrogates' ability to accurately model a probabilistic program with an unbounded number of random variables. We then proceed with an example that shows our surrogates are able to accurately model a complex structure like an unbounded stack in a program synthesis example. We further demonstrate how our surrogate modeling technique makes amortized inference in complex black-box simulators an order of magnitude faster. Specifically, we do simulator-based materials quality testing, inferring safety-critical latent internal temperature profiles of composite materials undergoing curing.
    Robust Implicit Networks via Non-Euclidean Contractions. (arXiv:2106.03194v6 [cs.LG] UPDATED)
    (2 min) Implicit neural networks, a.k.a., deep equilibrium networks, are a class of implicit-depth learning models where function evaluation is performed by solving a fixed point equation. They generalize classic feedforward models and are equivalent to infinite-depth weight-tied feedforward networks. While implicit models show improved accuracy and significant reduction in memory consumption, they can suffer from ill-posedness and convergence instability. This paper provides a new framework, which we call Non-Euclidean Monotone Operator Network (NEMON), to design well-posed and robust implicit neural networks based upon contraction theory for the non-Euclidean norm $\ell_{\infty}$. Our framework includes (i) a novel condition for well-posedness based on one-sided Lipschitz constants, (ii) an average iteration for computing fixed-points, and (iii) explicit estimates on input-output Lipschitz constants. Additionally, we design a training problem with the well-posedness condition and the average iteration as constraints and, to achieve robust models, with the input-output Lipschitz constant as a regularizer. Our $\ell_{\infty}$ well-posedness condition leads to a larger polytopic training search space than existing conditions and our average iteration enjoys accelerated convergence. Finally, we evaluate our framework in image classification through the MNIST and the CIFAR-10 datasets. Our numerical results demonstrate improved accuracy and robustness of the implicit models with smaller input-output Lipschitz bounds. Code is available at https://github.com/davydovalexander/Non-Euclidean_Mon_Op_Net.
    The edge of chaos: quantum field theory and deep neural networks. (arXiv:2109.13247v2 [hep-th] UPDATED)
    (2 min) We explicitly construct the quantum field theory corresponding to a general class of deep neural networks encompassing both recurrent and feedforward architectures. We first consider the mean-field theory (MFT) obtained as the leading saddlepoint in the action, and derive the condition for criticality via the largest Lyapunov exponent. We then compute the loop corrections to the correlation function in a perturbative expansion in the ratio of depth $T$ to width $N$, and find a precise analogy with the well-studied $O(N)$ vector model, in which the variance of the weight initializations plays the role of the 't Hooft coupling. In particular, we compute both the $\mathcal{O}(1)$ corrections quantifying fluctuations from typicality in the ensemble of networks, and the subleading $\mathcal{O}(T/N)$ corrections due to finite-width effects. These provide corrections to the correlation length that controls the depth to which information can propagate through the network, and thereby sets the scale at which such networks are trainable by gradient descent. Our analysis provides a first-principles approach to the rapidly emerging NN-QFT correspondence, and opens several interesting avenues to the study of criticality in deep neural networks.
    Variational multiple shooting for Bayesian ODEs with Gaussian processes. (arXiv:2106.10905v2 [cs.LG] UPDATED)
    (2 min) Recent machine learning advances have proposed black-box estimation of unknown continuous-time system dynamics directly from data. However, earlier works are based on approximative ODE solutions or point estimates. We propose a novel Bayesian nonparametric model that uses Gaussian processes to infer posteriors of unknown ODE systems directly from data. We derive sparse variational inference with decoupled functional sampling to represent vector field posteriors. We also introduce a probabilistic shooting augmentation to enable efficient inference from arbitrarily long trajectories. The method demonstrates the benefit of computing vector field posteriors, with predictive uncertainty scores outperforming alternative methods on multiple ODE learning tasks.
    Deviation inequalities for stochastic approximation by averaging. (arXiv:2102.08685v2 [math.PR] UPDATED)
    (2 min) We introduce a class of Markov chains, that contains the model of stochastic approximation by averaging and non-averaging. Using martingale approximation method, we establish various deviation inequalities for separately Lipschitz functions of such a chain, with different moment conditions on some dominating random variables of martingale differences.Finally, we apply these inequalities to the stochastic approximation by averaging and empirical risk minimisation.
    System identification using Bayesian neural networks with nonparametric noise models. (arXiv:2104.12119v3 [stat.ME] UPDATED)
    (2 min) System identification is of special interest in science and engineering. This article is concerned with a system identification problem arising in stochastic dynamic systems, where the aim is to estimate the parameters of a system along with its unknown noise processes. In particular, we propose a Bayesian nonparametric approach for system identification in discrete time nonlinear random dynamical systems assuming only the order of the Markov process is known. The proposed method replaces the assumption of Gaussian distributed error components with a highly flexible family of probability density functions based on Bayesian nonparametric priors. Additionally, the functional form of the system is estimated by leveraging Bayesian neural networks which also leads to flexible uncertainty quantification. Asymptotically on the number of hidden neurons, the proposed model converges to full nonparametric Bayesian regression model. A Gibbs sampler for posterior inference is proposed and its effectiveness is illustrated on simulated and real time series.
    Post-training Quantization for Neural Networks with Provable Guarantees. (arXiv:2201.11113v1 [cs.LG])
    (2 min) While neural networks have been remarkably successful in a wide array of applications, implementing them in resource-constrained hardware remains an area of intense research. By replacing the weights of a neural network with quantized (e.g., 4-bit, or binary) counterparts, massive savings in computation cost, memory, and power consumption are attained. We modify a post-training neural-network quantization method, GPFQ, that is based on a greedy path-following mechanism, and rigorously analyze its error. We prove that for quantizing a single-layer network, the relative square error essentially decays linearly in the number of weights -- i.e., level of over-parametrization. Our result holds across a range of input distributions and for both fully-connected and convolutional architectures. To empirically evaluate the method, we quantize several common architectures with few bits per weight, and test them on ImageNet, showing only minor loss of accuracy. We also demonstrate that standard modifications, such as bias correction and mixed precision quantization, further improve accuracy.
    Extreme events evaluation using CRPS distributions. (arXiv:1905.04022v2 [stat.ME] UPDATED)
    (2 min) Verification of probabilistic forecasts for extreme events has been a very active field of research, stirred by media and public opinions who naturally focus their attention on extreme events, and easily draw biased onclusions. In this context, classical verification methodologies tailored for extreme events, such as thresholded and weighted scoring rules, have undesirable properties that cannot be mitigated; the well-known Continuous Ranked Probability Score (CRPS) makes no exception. In this paper, we define a formal framework to assess the behavior of forecast evaluation procedures with respect to extreme events, that we use to point out that assessment based on the expectation of a proper score is not suitable for extremes. As an alternative, we propose to study the properties of the CRPS as a random variable using extreme value theory to address extreme events verification. To compare calibrated forecasts, an index is introduced that summarizes the ability of probabilistic forecasts to predict extremes. Its strengths and limitations are discussed using both theoretical arguments and simulations.
    Confidence intervals for the Cox model test error from cross-validation. (arXiv:2201.10770v1 [stat.ME])
    (2 min) Cross-validation (CV) is one of the most widely used techniques in statistical learning for estimating the test error of a model, but its behavior is not yet fully understood. It has been shown that standard confidence intervals for test error using estimates from CV may have coverage below nominal levels. This phenomenon occurs because each sample is used in both the training and testing procedures during CV and as a result, the CV estimates of the errors become correlated. Without accounting for this correlation, the estimate of the variance is smaller than it should be. One way to mitigate this issue is by estimating the mean squared error of the prediction error instead using nested CV. This approach has been shown to achieve superior coverage compared to intervals derived from standard CV. In this work, we generalize the nested CV idea to the Cox proportional hazards model and explore various choices of test error for this setting.
    Generalization Error Bounds on Deep Learning with Markov Datasets. (arXiv:2201.11059v1 [stat.ML])
    (2 min) In this paper, we derive upper bounds on generalization errors for deep neural networks with Markov datasets. These bounds are developed based on Koltchinskii and Panchenko's approach for bounding the generalization error of combined classifiers with i.i.d. datasets. The development of new symmetrization inequalities in high-dimensional probability for Markov chains is a key element in our extension, where the pseudo-spectral gap of the infinitesimal generator of the Markov chain plays as a key parameter in these inequalities. We also propose a simple method to convert these bounds and other similar bounds on traditional deep learning and machine learning to Bayesian counterparts for both i.i.d. and Markov datasets.
    Self-Directed Online Machine Learning for Topology Optimization. (arXiv:2002.01927v8 [cs.CE] UPDATED)
    (2 min) Topology optimization by optimally distributing materials in a given domain requires non-gradient optimizers to solve highly complicated problems. However, with hundreds of design variables or more involved, solving such problems would require millions of Finite Element Method (FEM) calculations whose computational cost is huge and impractical. Here we report Self-directed Online Learning Optimization (SOLO) which integrates Deep Neural Network (DNN) with FEM calculations. A DNN learns and substitutes the objective as a function of design variables. A small number of training data is generated dynamically based on the DNN's prediction of the optimum. The DNN adapts to the new training data and gives better prediction in the region of interest until convergence. The optimum predicted by the DNN is proved to converge to the true global optimum through iterations. Our algorithm was tested by four types of problems including compliance minimization, fluid-structure optimization, heat transfer enhancement and truss optimization. It reduced the computational time by 2 ~ 5 orders of magnitude compared with directly using heuristic methods, and outperformed all state-of-the-art algorithms tested in our experiments. This approach enables solving large multi-dimensional optimization problems.
    Personalized Federated Learning with Moreau Envelopes. (arXiv:2006.08848v3 [cs.LG] UPDATED)
    (2 min) Federated learning (FL) is a decentralized and privacy-preserving machine learning technique in which a group of clients collaborate with a server to learn a global model without sharing clients' data. One challenge associated with FL is statistical diversity among clients, which restricts the global model from delivering good performance on each client's task. To address this, we propose an algorithm for personalized FL (pFedMe) using Moreau envelopes as clients' regularized loss functions, which help decouple personalized model optimization from the global model learning in a bi-level problem stylized for personalized FL. Theoretically, we show that pFedMe's convergence rate is state-of-the-art: achieving quadratic speedup for strongly convex and sublinear speedup of order 2/3 for smooth nonconvex objectives. Experimentally, we verify that pFedMe excels at empirical performance compared with the vanilla FedAvg and Per-FedAvg, a meta-learning based personalized FL algorithm.
    Auto-Compressing Subset Pruning for Semantic Image Segmentation. (arXiv:2201.11103v1 [cs.CV])
    (2 min) State-of-the-art semantic segmentation models are characterized by high parameter counts and slow inference times, making them unsuitable for deployment in resource-constrained environments. To address this challenge, we propose \textsc{Auto-Compressing Subset Pruning}, \acosp, as a new online compression method. The core of \acosp consists of learning a channel selection mechanism for individual channels of each convolution in the segmentation model based on an effective temperature annealing schedule. We show a crucial interplay between providing a high-capacity model at the beginning of training and the compression pressure forcing the model to compress concepts into retained channels. We apply \acosp to \segnet and \pspnet architectures and show its success when trained on the \camvid, \city, \voc, and \ade datasets. The results are competitive with existing baselines for compression of segmentation models at low compression ratios and outperform them significantly at high compression ratios, yielding acceptable results even when removing more than $93\%$ of the parameters. In addition, \acosp is conceptually simple, easy to implement, and can readily be generalized to other data modalities, tasks, and architectures. Our code is available at \url{https://github.com/merantix/acosp}.
    BRIDGE: Byzantine-resilient Decentralized Gradient Descent. (arXiv:1908.08098v2 [stat.ML] UPDATED)
    (2 min) Machine learning has begun to play a central role in many applications. A multitude of these applications typically also involve datasets that are distributed across multiple computing devices/machines due to either design constraints (e.g., multiagent systems) or computational/privacy reasons (e.g., learning on smartphone data). Such applications often require the learning tasks to be carried out in a decentralized fashion, in which there is no central server that is directly connected to all nodes. In real-world decentralized settings, nodes are prone to undetected failures due to malfunctioning equipment, cyberattacks, etc., which are likely to crash non-robust learning algorithms. The focus of this paper is on robustification of decentralized learning in the presence of nodes that have undergone Byzantine failures. The Byzantine failure model allows faulty nodes to arbitrarily deviate from their intended behaviors, thereby ensuring designs of the most robust of algorithms. But the study of Byzantine resilience within decentralized learning, in contrast to distributed learning, is still in its infancy. In particular, existing Byzantine-resilient decentralized learning methods either do not scale well to large-scale machine learning models, or they lack statistical convergence guarantees that help characterize their generalization errors. In this paper, a scalable, Byzantine-resilient decentralized machine learning framework termed Byzantine-resilient decentralized gradient descent (BRIDGE) is introduced. Algorithmic and statistical convergence guarantees for one variant of BRIDGE are also provided in the paper for both strongly convex problems and a class of nonconvex problems. In addition, large-scale decentralized learning experiments are used to establish that the BRIDGE framework is scalable and it delivers competitive results for Byzantine-resilient convex and nonconvex learning.
    Combining Experimental and Observational Data for Identification of Long-Term Causal Effects. (arXiv:2201.10743v1 [stat.ME])
    (2 min) We consider the task of estimating the causal effect of a treatment variable on a long-term outcome variable using data from an observational domain and an experimental domain. The observational data is assumed to be confounded and hence without further assumptions, this dataset alone cannot be used for causal inference. Also, only a short-term version of the primary outcome variable of interest is observed in the experimental data, and hence, this dataset alone cannot be used for causal inference either. In a recent work, Athey et al. (2020) proposed a method for systematically combining such data for identifying the downstream causal effect in view. Their approach is based on the assumptions of internal and external validity of the experimental data, and an extra novel assumption called latent unconfoundedness. In this paper, we first review their proposed approach and discuss the latent unconfoundedness assumption. Then we propose two alternative approaches for data fusion for the purpose of estimating average treatment effect as well as the effect of treatment on the treated. Our first proposed approach is based on assuming equi-confounding bias for the short-term and long-term outcomes. Our second proposed approach is based on the proximal causal inference framework, in which we assume the existence of an extra variable in the system which is a proxy of the latent confounder of the treatment-outcome relation.
    Complexity of zigzag sampling algorithm for strongly log-concave distributions. (arXiv:2012.11094v2 [stat.ML] UPDATED)
    (2 min) We study the computational complexity of zigzag sampling algorithm for strongly log-concave distributions. The zigzag process has the advantage of not requiring time discretization for implementation, and that each proposed bouncing event requires only one evaluation of partial derivative of the potential, while its convergence rate is dimension independent. Using these properties, we prove that the zigzag sampling algorithm achieves $\varepsilon$ error in chi-square divergence with a computational cost equivalent to $O\bigl(\kappa^2 d^\frac{1}{2}(\log\frac{1}{\varepsilon})^{\frac{3}{2}}\bigr)$ gradient evaluations in the regime $\kappa \ll \frac{d}{\log d}$ under a warm start assumption, where $\kappa$ is the condition number and $d$ is the dimension.
    Uphill Roads to Variational Tightness: Monotonicity and Monte Carlo Objectives. (arXiv:2201.10989v1 [stat.ML])
    (2 min) We revisit the theory of importance weighted variational inference (IWVI), a promising strategy for learning latent variable models. IWVI uses new variational bounds, known as Monte Carlo objectives (MCOs), obtained by replacing intractable integrals by Monte Carlo estimates -- usually simply obtained via importance sampling. Burda, Grosse and Salakhutdinov (2016) showed that increasing the number of importance samples provably tightens the gap between the bound and the likelihood. Inspired by this simple monotonicity theorem, we present a series of nonasymptotic results that link properties of Monte Carlo estimates to tightness of MCOs. We challenge the rationale that smaller Monte Carlo variance leads to better bounds. We confirm theoretically the empirical findings of several recent papers by showing that, in a precise sense, negative correlation reduces the variational gap. We also generalise the original monotonicity theorem by considering non-uniform weights. We discuss several practical consequences of our theoretical results. Our work borrows many ideas and results from the theory of stochastic orders.
    Supervised learning of sheared distributions using linearized optimal transport. (arXiv:2201.10590v1 [math.ST])
    (2 min) In this paper we study supervised learning tasks on the space of probability measures. We approach this problem by embedding the space of probability measures into $L^2$ spaces using the optimal transport framework. In the embedding spaces, regular machine learning techniques are used to achieve linear separability. This idea has proved successful in applications and when the classes to be separated are generated by shifts and scalings of a fixed measure. This paper extends the class of elementary transformations suitable for the framework to families of shearings, describing conditions under which two classes of sheared distributions can be linearly separated. We furthermore give necessary bounds on the transformations to achieve a pre-specified separation level, and show how multiple embeddings can be used to allow for larger families of transformations. We demonstrate our results on image classification tasks.
    FIGARO: Generating Symbolic Music with Fine-Grained Artistic Control. (arXiv:2201.10936v1 [cs.SD])
    (2 min) Generating music with deep neural networks has been an area of active research in recent years. While the quality of generated samples has been steadily increasing, most methods are only able to exert minimal control over the generated sequence, if any. We propose the self-supervised \emph{description-to-sequence} task, which allows for fine-grained controllable generation on a global level by extracting high-level features about the target sequence and learning the conditional distribution of sequences given the corresponding high-level description in a sequence-to-sequence modelling setup. We train FIGARO (FIne-grained music Generation via Attention-based, RObust control) by applying \emph{description-to-sequence} modelling to symbolic music. By combining learned high level features with domain knowledge, which acts as a strong inductive bias, the model achieves state-of-the-art results in controllable symbolic music generation and generalizes well beyond the training distribution.
    Towards Sharp Stochastic Zeroth Order Hessian Estimators over Riemannian Manifolds. (arXiv:2201.10780v1 [stat.ML])
    (2 min) We study Hessian estimators for real-valued functions defined over an $n$-dimensional complete Riemannian manifold. We introduce new stochastic zeroth-order Hessian estimators using $O (1)$ function evaluations. We show that, for a smooth real-valued function $f$ with Lipschitz Hessian (with respect to the Rimannian metric), our estimator achieves a bias bound of order $ O \left( L_2 \delta + \gamma \delta^2 \right) $, where $ L_2 $ is the Lipschitz constant for the Hessian, $ \gamma $ depends on both the Levi-Civita connection and function $f$, and $\delta$ is the finite difference step size. To the best of our knowledge, our results provide the first bias bound for Hessian estimators that explicitly depends on the geometry of the underlying Riemannian manifold. Perhaps more importantly, our bias bound does not increase with dimension $n$. This improves best previously known bias bound for $O(1)$-evaluation Hessian estimators, which increases quadratically with $n$. We also study downstream computations based on our Hessian estimators. The supremacy of our method is evidenced by empirical evaluations.
  • Google AI Blog

    Controlling Neural Networks with Rule Representations
    (9 min) Posted by Sungyong Seo, Software Engineer and Sercan O. Arik, Research Scientist, Google Research, Cloud AI Team Deep neural networks (DNNs) provide more accurate results as the size and coverage of their training data increases. While investing in high-quality and large-scale labeled datasets is one path to model improvement, another is leveraging prior knowledge, concisely referred to as “rules” — reasoning heuristics, equations, associative logic, or constraints. Consider a common example from physics where a model is given the task of predicting the next state in a double pendulum system. While the model may learn to estimate the total energy of the system at a given point in time only from empirical data, it will frequently overestimate the energy unless also provided an equation th…
  • The Official NVIDIA Blog

    How Smart Hospital Technology Can Help Cut Down on Medical Errors
    (3 min) Despite the feats of modern medicine, as many as 250,000 Americans die from medical errors each year — more than 6 times the number killed in car accidents. Smart hospital AI can help avoid some of these fatalities in healthcare, just as computer vision-based driver assistance systems can improve road safety, according to AI leader Read article > The post How Smart Hospital Technology Can Help Cut Down on Medical Errors appeared first on The Official NVIDIA Blog.
  • AI Weirdness

    New AI paint colors
    (4 min) One of my first experiments with neural network text generation was to generate and name paint colors. I trained a neural net from scratch on lists of colors I could find online, and with no prior training on English or any language (and therefore no idea what paint colors were)
    Bonus: More paint colors
    (1 min) AI Weirdness: the strange side of machine learning
  • AWS Machine Learning Blog

    Run AutoML experiments with large parquet datasets using Amazon SageMaker Autopilot
    (3 min) Starting today, you can use Amazon SageMaker Autopilot to tackle regression and classification tasks on large datasets up to 100 GB. Additionally, you can now provide your datasets in either CSV or Apache Parquet content types. Businesses are generating more data than ever. A corresponding demand is growing for generating insights from these large datasets […]
    Use a web browser plugin to quickly translate text with Amazon Translate
    (5 min) Web browsers can be a single pane of glass for organizations to interact with their information—all of the tools can be viewed and accessed on one screen so that users don’t have to switch between applications and interfaces. For example, a customer call center might have several different applications to see customer reviews, social media […]
  • Becoming Human: Artificial Intelligence Magazine - Medium

    Basics of Image Recognition: A beginner’s approach
    (5 min) “Just as electricity transformed almost everything 100 years ago, today I even have a tough time thinking of an industry that i do not…
    Semi-Supervised Learning in Computer Vision
    (7 min) In this work, I am introducing Semi-supervised learning for Image recognition. This is just a gentle introduction, so I am not diving into…
    6 computer vision techniques that you should know
    (5 min) Squid game’s AI manages to kill the right people at red light/green light game. PSG’s AI automatically detects that Messi doesn’t run much… Continue reading on Becoming Human: Artificial Intelligence Magazine »
  • John D. Cook

    Mental math trick and deeper issues
    (3 min) I’d like to do two things in this post. I’ll present a fun bit of mental math, then use it to illustrate a few deeper points. The story starts with a Twitter thread I wrote on @AlgebraFact. The Strait of Gibraltar is 13 km wide. How wide is that in miles? You can convert km […] Mental math trick and deeper issues first appeared on John D. Cook.
  • Artificial Intelligence

    Artificial Nightmares : Sifu Splinter || DD PYTTI VQGAN AI Art Video [4K 60 FPS]
    (0 min) submitted by /u/Thenamessd [link] [comments]
    I wrote summaries for 76 papers for Casual GAN Papers last year. Here is my ranking of the best papers from 2021!
    (1 min) Hi everyone! There is an “X” of the year award in pretty much every industry ever, and ranking things is fun, which is reason enough for us to hold the first annual Casual GAN Papers Awards for the year 2021! This isn’t going to be a simple top-5 list, since pretty much all of the papers I covered this year are the cream of the crop in what they do, as judged by yours truly and my imaginary council of distinguished ML experts! The purpose of this post is simply to celebrate the amazing achievements in machine learning research over the last year and highlight some of the larger trends that I have noticed while analyzing the papers I read every week. https://www.casualganpapers.com/hiqh_quality_video_editing_stylegan_inversion/Stitch-It-In-Time-explained.html Subscribe to Casual GAN Papers and follow me on Twitter for weekly AI paper summaries! submitted by /u/KirillTheMunchKing [link] [comments]
    Our Lady Peace on Reuniting with Futurist Ray Kurzweil for a Sequel to Spiritual Machines
    (0 min) submitted by /u/Extreme_Homework7936 [link] [comments]
    separate vocals From other vocals in song (not vocals From instrumental)
    (1 min) so basically can anyone train an A.I that separates vocals from other vocals. Like literally you take a song and then put people talking on top of it really loud and train the ai to separate the song vocals and instrumental from the people talking on top of it really loud. the reason im asking is because theres some snippets of songs from documentaries but people are talking on top really loud so we cant make a good quality version. here is an example: https://youtu.be/LWODlrNujnU submitted by /u/Xiffo_ [link] [comments]
    Meta AI Introduces ‘Data2vec’ : A Self-Supervised Algorithm That Works For Speech, Computer Vision, and NLP
    (1 min) Many critical recent developments in AI have been enabled by self-supervised learning. Machines learn by directly observing the world rather than being explicitly instructed through labeled images, text, audio, and other data sources. However, whereas people appear to learn in similar ways regardless of how they obtain information — whether, through sight or hearing, self-supervised learning algorithms currently learn from images, speech, text, and other modalities in quite different ways. Continue Reading....... Paper: https://ai.facebook.com/research/data2vec-a-general-framework-for-self-supervised-learning-in-speech-vision-and-language Github: https://github.com/pytorch/fairseq/tree/main/examples/data2vec https://i.redd.it/02ps1fg0uge81.gif submitted by /u/ai-lover [link] [comments]
    Built an AI to beat the AI companies use to auto-reject job applications
    (2 min) submitted by /u/Camjw1123 [link] [comments]
    AI that generates pokemon
    (1 min) So I saw the articles about Max Woolf's AI that looked at a bunch of pokemon and made fakemon on it's own. I found a link where people can use it via using a quiz, but was wondering if anyone knows of a similar AI (or the AI itself) but can download it to PC and change all the pokemon images it uses to look at, and replace them with another collection of images from other games? I'm interested to try this thing with various other games :P For those who want to see the quiz that uses this AI, I will include it at the end. Thank you all for taking the time to read this and give advice: https://www.buzzfeed.com/daves4/pokemon-ai-quiiz submitted by /u/Aeromorpher [link] [comments]
    Meta AI Builds A Massive AI Research Supercomputer To Advance The Field Of Machine Learning And Data Science Research
    (1 min) In recent years, Meta has made a significant amount of contribution in the field of self-supervised learning and transformer-based models. Many of these AI models have revolutionized various domains such as computer vision, natural language processing, automatic speech recognition, etc. Adding improvements and training such models with a massive number of parameters for utilizing the vast amount of data that is made available daily will require new secure and reliable AI supercomputers capable of bringing down the training time by a critical factor. To address the need for massive computational power needed to train large AI models, Meta uses NVIDIA’s technology intending to develop advanced supercomputers capable of running quadrillions of operations per second. Meta and NVIDIA have built the AI Research SuperCluster (RSC), one of the fastest supercomputers, and is believed to become the fastest supercomputer during its completion in mid-2022. Researchers at Meta have already started to use the AI RSC to train their state-of-the-art AI models with well over trillions of parameters. Continue Reading submitted by /u/ai-lover [link] [comments]
    Artificial Nightmares : Pennywise the Dancing Clown || DD PYTTI VQGAN AI Art Video [4K 60 FPS]
    (0 min) submitted by /u/Thenamessd [link] [comments]
    Discovered DonkeyStereotype :)
    (0 min) Checkout this demo - https://share.streamlit.io/do-st/esvp/main/app.py submitted by /u/prithivida [link] [comments]
    Use of symbolic AI in gaming
    (0 min) If someone was trying to make a AI bot that will play a certain game, similar to League of Legends or Counter Strike, should it be made with symbolic computation, because deep learning or other statistic based AI would be hard to make since, since u would have to play the game in real time. submitted by /u/Both_Battle724 [link] [comments]
    People prefer to leave moral decisions to humans rather than algorithms that strictly follow human-created fairness principles, a study finds. The reason: people want to be able to break their own fairness principles whenever they want.
    (2 min) submitted by /u/LisaMck041 [link] [comments]
  • Machine Learning

    [R] ICLR 2022 Accepted Papers List
    (0 min) Just released: https://openreview.net/group?id=ICLR.cc/2022/Conference#spotlight-submissions What are some of the most interesting papers to check out? submitted by /u/iidealized [link] [comments]
    [D] Which lower-end RTX 3000 cards are suitable for some light model training?
    (1 min) I want to get a lower-end RTX 3000 card which I can also use to train some smaller models/debug code etc. I originally thought an RTX 3060 might be good but apparently it is (probably intentionally) not well supported when it comes to cuda things. And RTX 3050 will likely have the same issue and wouldn't be worth bothering with in any case. Are the 3060ti/3070/3070 ti cards any good? submitted by /u/AuspiciousApple [link] [comments]
    VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning
    (0 min) submitted by /u/deusexmachina2100 [link] [comments]
    [D] Percy Liang on ML Robustness, Foundation Models, and Reproducibility for AI experiments with CodaLab
    (1 min) Hi! Just want to share the latest episode of the gradient podcast, where I talked with Percy Liang, an Associate Professor of Computer Science at Stanford University and the director of the Center for Research on Foundation Models. Percy Liang’s research spans many topics in machine learning and natural language processing, including robustness, interpretability, semantics, and reasoning. We touch on the following topics / papers: Start in research Semantic Parsing Getting ideas for research directions from students Foundation Models On the Opportunities and Risks of Foundation Models Reflections on Foundation Models ML Robustness Removing spurious features can hurt accuracy and affect groups disproportionately. Selective classification can magnify disparities across groups Just train twice: improving group robustness without training group information LILA: language-informed latent actions CodaLab As always, welcome feedback! submitted by /u/regalalgorithm [link] [comments]
    [D] I wrote summaries for 76 papers for Casual GAN Papers last year. Here is my ranking of the best papers from 2021!
    (1 min) Hi everyone! There is an “X” of the year award in pretty much every industry ever, and ranking things is fun, which is reason enough for us to hold the first annual Casual GAN Papers Awards for the year 2021! This isn’t going to be a simple top-5 list, since pretty much all of the papers I covered this year are the cream of the crop in what they do, as judged by yours truly and my imaginary council of distinguished ML experts! The purpose of this post is simply to celebrate the amazing achievements in machine learning research over the last year and highlight some of the larger trends that I have noticed while analyzing the papers I read every week. https://www.casualganpapers.com/hiqh_quality_video_editing_stylegan_inversion/Stitch-It-In-Time-explained.html Subscribe to Casual GAN Papers and follow me on Twitter for weekly AI paper summaries! submitted by /u/KirillTheMunchKing [link] [comments]
    [N][R][P] moolib: A new library for distributed ML training with PyTorch from FAIR
    (0 min) Today, FAIR researchers announced on twitter a new library for distributed ML, called moolib. white paper: https://research.facebook.com/publications/moolib-a-platform-for-distributed-rl/ code: https://github.com/facebookresearch/moolib A discussion of this project by PyTorch lead, Soumith Chintala: https://twitter.com/soumithchintala/status/1487158053992550405 submitted by /u/egrefen [link] [comments]
    [R] Citation info for KNN
    (1 min) What do you cite for using a KNN model? Everything else I've used, from glove embeddings to freaking mutual information, I could fine the original publication with enough work. For KNN, the closest thing I have is this, which I have no clue how to cite. submitted by /u/bearson97 [link] [comments]
    [D],[R] What computer vision task do describing the features of images come under?
    (1 min) I have a very pressing question which I am not able to find the solution to anywhere- Is there a computer vision task category for the task when you describe the objective feature of the images like - brightness, focus, artefacts and hence classify the image in a certain category. It would come under classification with localisation if we were drawing the bounding box around the object but in this case, one can't physically locate brightness, focus and artefacts, as those are features of the images. Is there a vision task category under which this falls? submitted by /u/the_artist786 [link] [comments]
    [D] It seems OpenAI’s new embedding models perform terribly
    (5 min) Some people on Twitter have been investigating OpenAI’s new embedding API and it’s shocking how poorly it performs. On standard benchmarks, open source models 1000x smaller obtain equal or better performance! Models based on RoBERTa and T5, as well as the Sentence Transformer all achieve significantly better performance than the 175B model. Also of interest is that the DaVinci (175B) model is not clearly better than the Ada (350M) model. Has anyone tried adapting some other autoregressive languages models, such as GPT-2, GPT-Neo, or GPT-J to do embeddings? I’m quite curious if this is an inherent failing of autoregressive models or if there’s something else going on. Edit: a commenter has asked that I point out that I am one of the creators of GPT-Neo and part of the org that created GPT-J. These examples were not intended as specific endorsements, and I would be just as interested in comparisons using other billion-parameter+ autoregressive language models. Edit: I originally linked to a tweet about this, but several commenters pointed out that there’s also a blog post with more information. submitted by /u/StellaAthena [link] [comments]
    [P] My viking metal lyrics AI project has released its first song
    (3 min) I fine-tuned TorchRNN to viking metal lyrics and turned that into a web service (https://vikinglyrics.metalcoder.dev). Then I learned how to growl and made a song: https://www.youtube.com/watch?v=zOqY3KgdSLI It's going to be on Spotify next week. The next step of the project is underway. I fine-tuned a GPT2 model to get better results, and I'm making a new song using that. submitted by /u/osirisguitar [link] [comments]
    [D] How do you choose which Black-Box Explainability method to use?
    (1 min) There's several black box explainability methods available out there. Often these produce very/slightly different explanations for the exact same decision and the exact same model. (Saliency maps v/s Integrated Gradients v/s LRP etc) How do you quantify how good the quality of an explanation is? submitted by /u/mayhemonster [link] [comments]
  • Neural Networks, Deep Learning and Machine Learning

    Would a simple neural network work for multiple things or would I have to build a new one for each project?
    (1 min) I’m new to neural networks and I would like to know if this would work, thanks. submitted by /u/CalvinsCarvings [link] [comments]
  • Reinforcement Learning

    How would you answer questions like these?
    (1 min) While comparing a traditional rule-based method and an RL method --- "This is not a fair comparison, you do not know what the rule-based method's optimization objectives". While showing that the offline-RL method outperforms the behavioral agent's performance ---"How could a student outperforms his/her teacher?" submitted by /u/Blasphemer666 [link] [comments]
    How Amazon Music's recommender utilizes reinforcement learning
    (1 min) submitted by /u/ordinarytime [link] [comments]
    Environment for non-episodic continuing tasks
    (1 min) Hi! I am going to write my master's thesis about using PG /actor critic methods for process optimization. As a case study I think it would be fun to have some visual game-like enviroment to run the algorithms on, as a comparison. I was therefore looking everywhere to find environment simulations with continuous action spaces that are non-episodic, to on-line learn how to optimize the policies. I thought the Pendulum environment was quite fitting, or the mountaincar-continuous. But these all terminate and restart after certain number of steps. Do you guys know of anything, or have any ideas? It would be even better if the environment was (slowly) time-varying. submitted by /u/Acrobatic-Ad-9189 [link] [comments]
    Is DQN truly off-policy?
    (4 min) DQN uses as an exploration policy the ε-greedy behaviour over the network's predicted Q-values. So in effect, it partially uses the learnt policy to explore the environment. It seems to me that the definition of off-policy is not the same for everyone. In particular, I often see two different definitions: A: An off-policy method uses a different policy for exploration than the policy that is learnt. B: An off-policy method uses an independent policy for exploration from the policy that is learnt. ​ Clearly, DQN's exploration policy is different but not independent from the target policy. So I would be eager to say that the off vs on policy distinction is not a binary one, but it is rather a spectrum1. Nonetheless, I understand that DQN can be trained entirely off-policy by simply using an experience replay collected by any policy (that has explored the MDP sufficiently) and minimising the TD error in that. But isn't the main point of RL to make agents that explore environments efficiently? ​ 1: In fact, for the case of DQN, the difference can be quantifiable. The probability for the exploration policy to select a different action from the target policy is exactly ε. I am braindumping here, but maybe that opens up a research direction? Perhaps by using something like the KL-divergence for measuring the difference between exploration and target policies (for stochastic ones at least)? submitted by /u/SomeParanoidAndroid [link] [comments]
    "Surprisingly Robust In-Hand Manipulation: An Empirical Study", Bhatt et al 2022 (hand-designed primitives for inflatable hand: learning-free, open loop, but still reliably manipulate cubes)
    (1 min) submitted by /u/gwern [link] [comments]
    How do I self learn reinforcement learning with no background in CS?
    (2 min) I’m trying to jump into MDPS and POMDPS and holy shit I might give up. I don’t understand anything and I read so much on it yet still don’t understand shit. Am I just too stupid for this field? Please give me some motivation and advice!! submitted by /u/lilceezee [link] [comments]
    What are some good resources to get started in reinforcement learning? Books, videos, etc.
    (1 min) submitted by /u/OpportunityBrave3183 [link] [comments]
    Find someone for academic communication online (RL direction)
    (1 min) Hello, I'm pursuing my Ph.D. degree in Beihang University, China. My research interests include reinforcement learning, machine learning, and their applications to combinatorial optimization problems in cloud manufacturing. Because of the COVID-19, I lose the chance of going abroad for academic communication, so I really want to find someone overseas to communicate or collaborate or just chat (to practice my poor English). I don't know whether the software wechat is available for overseas friends, please comment if you have similar research interests and want academic communication. submitted by /u/rl4co [link] [comments]

2022-01-27

  • cs.LG updates on arXiv.org

    Online Active Learning with Dynamic Marginal Gain Thresholding. (arXiv:2201.10547v1 [cs.LG])
    (2 min) The blessing of ubiquitous data also comes with a curse: the communication, storage, and labeling of massive, mostly redundant datasets. In our work, we seek to solve the problem at its source, collecting only valuable data and throwing out the rest, via active learning. We propose an online algorithm which, given any stream of data, any assessment of its value, and any formulation of its selection cost, extracts the most valuable subset of the stream up to a constant factor while using minimal memory. Notably, our analysis also holds for the federated setting, in which multiple agents select online from individual data streams without coordination and with potentially very different appraisals of cost. One particularly important use case is selecting and labeling training sets from unlabeled collections of data that maximize the test-time performance of a given classifier. In prediction tasks on ImageNet and MNIST, we show that our selection method outperforms random selection by up to 5-20%.
    Revisiting and Advancing Fast Adversarial Training Through The Lens of Bi-Level Optimization. (arXiv:2112.12376v3 [cs.LG] UPDATED)
    (2 min) Adversarial training (AT) is a widely recognized defense mechanism to gain the robustness of deep neural networks against adversarial attacks. It is built on min-max optimization (MMO), where the minimizer (i.e., defender) seeks a robust model to minimize the worst-case training loss in the presence of adversarial examples crafted by the maximizer (i.e., attacker). However, the conventional MMO method makes AT hard to scale. Thus, Fast-AT and other recent algorithms attempt to simplify MMO by replacing its maximization step with the single gradient sign-based attack generation step. Although easy to implement, FAST-AT lacks theoretical guarantees, and its empirical performance is unsatisfactory due to the issue of robust catastrophic overfitting when training with strong adversaries. In this paper, we advance Fast-AT from the fresh perspective of bi-level optimization (BLO). We first show that the commonly-used Fast-AT is equivalent to using a stochastic gradient algorithm to solve a linearized BLO problem involving a sign operation. However, the discrete nature of the sign operation makes it difficult to understand the algorithm performance. Inspired by BLO, we design and analyze a new set of robust training algorithms termed Fast Bi-level AT (Fast-BAT), which effectively defends sign-based projected gradient descent (PGD) attacks without using any gradient sign method or explicit robust regularization. In practice, we show that our method yields substantial robustness improvements over multiple baselines across multiple models and datasets.
    IMO$^3$: Interactive Multi-Objective Off-Policy Optimization. (arXiv:2201.09798v2 [cs.LG] UPDATED)
    (2 min) Most real-world optimization problems have multiple objectives. A system designer needs to find a policy that trades off these objectives to reach a desired operating point. This problem has been studied extensively in the setting of known objective functions. We consider a more practical but challenging setting of unknown objective functions. In industry, this problem is mostly approached with online A/B testing, which is often costly and inefficient. As an alternative, we propose interactive multi-objective off-policy optimization (IMO$^3$). The key idea in our approach is to interact with a system designer using policies evaluated in an off-policy fashion to uncover which policy maximizes her unknown utility function. We theoretically show that IMO$^3$ identifies a near-optimal policy with high probability, depending on the amount of feedback from the designer and training data for off-policy estimation. We demonstrate its effectiveness empirically on multiple multi-objective optimization problems.
    A deep mixture density network for outlier-corrected interpolation of crowd-sourced weather data. (arXiv:2201.10544v1 [stat.ML])
    (2 min) As the costs of sensors and associated IT infrastructure decreases - as exemplified by the Internet of Things - increasing volumes of observational data are becoming available for use by environmental scientists. However, as the number of available observation sites increases, so too does the opportunity for data quality issues to emerge, particularly given that many of these sensors do not have the benefit of official maintenance teams. To realise the value of crowd sourced 'Internet of Things' type observations for environmental modelling, we require approaches that can automate the detection of outliers during the data modelling process so that they do not contaminate the true distribution of the phenomena of interest. To this end, here we present a Bayesian deep learning approach for spatio-temporal modelling of environmental variables with automatic outlier detection. Our approach implements a Gaussian-uniform mixture density network whose dual purposes - modelling the phenomenon of interest, and learning to classify and ignore outliers - are achieved simultaneously, each by specifically designed branches of our neural network. For our example application, we use the Met Office's Weather Observation Website data, an archive of observations from around 1900 privately run and unofficial weather stations across the British Isles. Using data on surface air temperature, we demonstrate how our deep mixture model approach enables the modelling of a highly skilled spatio-temporal temperature distribution without contamination from spurious observations. We hope that adoption of our approach will help unlock the potential of incorporating a wider range of observation sources, including from crowd sourcing, into future environmental models.
    Explainability in Music Recommender Systems. (arXiv:2201.10528v1 [cs.LG])
    (2 min) The most common way to listen to recorded music nowadays is via streaming platforms which provide access to tens of millions of tracks. To assist users in effectively browsing these large catalogs, the integration of Music Recommender Systems (MRSs) has become essential. Current real-world MRSs are often quite complex and optimized for recommendation accuracy. They combine several building blocks based on collaborative filtering and content-based recommendation. This complexity can hinder the ability to explain recommendations to end users, which is particularly important for recommendations perceived as unexpected or inappropriate. While pure recommendation performance often correlates with user satisfaction, explainability has a positive impact on other factors such as trust and forgiveness, which are ultimately essential to maintain user loyalty. In this article, we discuss how explainability can be addressed in the context of MRSs. We provide perspectives on how explainability could improve music recommendation algorithms and enhance user experience. First, we review common dimensions and goals of recommenders' explainability and in general of eXplainable Artificial Intelligence (XAI), and elaborate on the extent to which these apply -- or need to be adapted -- to the specific characteristics of music consumption and recommendation. Then, we show how explainability components can be integrated within a MRS and in what form explanations can be provided. Since the evaluation of explanation quality is decoupled from pure accuracy-based evaluation criteria, we also discuss requirements and strategies for evaluating explanations of music recommendations. Finally, we describe the current challenges for introducing explainability within a large-scale industrial music recommender system and provide research perspectives.
    Convex Analysis of the Mean Field Langevin Dynamics. (arXiv:2201.10469v1 [stat.ML])
    (2 min) As an example of the nonlinear Fokker-Planck equation, the mean field Langevin dynamics attracts attention due to its connection to (noisy) gradient descent on infinitely wide neural networks in the mean field regime, and hence the convergence property of the dynamics is of great theoretical interest. In this work, we give a simple and self-contained convergence rate analysis of the mean field Langevin dynamics with respect to the (regularized) objective function in both continuous and discrete time settings. The key ingredient of our proof is a proximal Gibbs distribution $p_q$ associated with the dynamics, which, in combination of techniques in [Vempala and Wibisono (2019)], allows us to develop a convergence theory parallel to classical results in convex optimization. Furthermore, we reveal that $p_q$ connects to the duality gap in the empirical risk minimization setting, which enables efficient empirical evaluation of the algorithm convergence.
    Primer: Searching for Efficient Transformers for Language Modeling. (arXiv:2109.08668v2 [cs.LG] UPDATED)
    (2 min) Large Transformer models have been central to recent advances in natural language processing. The training and inference costs of these models, however, have grown rapidly and become prohibitively expensive. Here we aim to reduce the costs of Transformers by searching for a more efficient variant. Compared to previous approaches, our search is performed at a lower level, over the primitives that define a Transformer TensorFlow program. We identify an architecture, named Primer, that has a smaller training cost than the original Transformer and other variants for auto-regressive language modeling. Primer's improvements can be mostly attributed to two simple modifications: squaring ReLU activations and adding a depthwise convolution layer after each Q, K, and V projection in self-attention. Experiments show Primer's gains over Transformer increase as compute scale grows and follow a power law with respect to quality at optimal model sizes. We also verify empirically that Primer can be dropped into different codebases to significantly speed up training without additional tuning. For example, at a 500M parameter size, Primer improves the original T5 architecture on C4 auto-regressive language modeling, reducing the training cost by 4X. Furthermore, the reduced training cost means Primer needs much less compute to reach a target one-shot performance. For instance, in a 1.9B parameter configuration similar to GPT-3 XL, Primer uses 1/3 of the training compute to achieve the same one-shot performance as Transformer. We open source our models and several comparisons in T5 to help with reproducibility.
    A Survey on Machine Learning-based Misbehavior Detection Systems for 5G and Beyond Vehicular Networks. (arXiv:2201.10500v1 [cs.CR])
    (2 min) Significant progress has been made towards deploying Vehicle-to-Everything (V2X) technology. Integrating V2X with 5G has enabled ultra-low latency and high-reliability V2X communications. However, while communication performance has enhanced, security and privacy issues have increased. Attacks have become more aggressive, and attackers have become more strategic. Public Key Infrastructure proposed by standardization bodies cannot solely defend against these attacks. Thus, in complementary of that, sophisticated systems should be designed to detect such attacks and attackers. Machine Learning (ML) has recently emerged as a key enabler to secure our future roads. Many V2X Misbehavior Detection Systems (MDSs) have adopted this paradigm. Yet, analyzing these systems is a research gap, and developing effective ML-based MDSs is still an open issue. To this end, this paper present a comprehensive survey and classification of ML-based MDSs. We analyze and discuss them from both security and ML perspectives. Then, we give some learned lessons and recommendations helping in developing, validating, and deploying ML-based MDSs. Finally, we highlight open research and standardization issues with some future directions.
    Learning Bi-typed Multi-relational Heterogeneous Graph via Dual Hierarchical Attention Networks. (arXiv:2112.13078v2 [cs.LG] UPDATED)
    (2 min) Bi-type multi-relational heterogeneous graph (BMHG) is one of the most common graphs in practice, for example, academic networks, e-commerce user behavior graph and enterprise knowledge graph. It is a critical and challenge problem on how to learn the numerical representation for each node to characterize subtle structures. However, most previous studies treat all node relations in BMHG as the same class of relation without distinguishing the different characteristics between the intra-class relations and inter-class relations of the bi-typed nodes, causing the loss of significant structure information. To address this issue, we propose a novel Dual Hierarchical Attention Networks (DHAN) based on the bi-typed multi-relational heterogeneous graphs to learn comprehensive node representations with the intra-class and inter-class attention-based encoder under a hierarchical mechanism. Specifically, the former encoder aggregates information from the same type of nodes, while the latter aggregates node representations from its different types of neighbors. Moreover, to sufficiently model node multi-relational information in BMHG, we adopt a newly proposed hierarchical mechanism. By doing so, the proposed dual hierarchical attention operations enable our model to fully capture the complex structures of the bi-typed multi-relational heterogeneous graphs. Experimental results on various tasks against the state-of-the-arts sufficiently confirm the capability of DHAN in learning node representations on the BMHGs.
    Spatial Graph Attention and Curiosity-driven Policy for Antiviral Drug Discovery. (arXiv:2106.02190v4 [cs.LG] UPDATED)
    (2 min) We developed Distilled Graph Attention Policy Network (DGAPN), a reinforcement learning model to generate novel graph-structured chemical representations that optimize user-defined objectives by efficiently navigating a physically constrained domain. The framework is examined on the task of generating molecules that are designed to bind, noncovalently, to functional sites of SARS-CoV-2 proteins. We present a spatial Graph Attention (sGAT) mechanism that leverages self-attention over both node and edge attributes as well as encoding the spatial structure -- this capability is of considerable interest in synthetic biology and drug discovery. An attentional policy network is introduced to learn the decision rules for a dynamic, fragment-based chemical environment, and state-of-the-art policy gradient techniques are employed to train the network with stability. Exploration is driven by the stochasticity of the action space design and the innovation reward bonuses learned and proposed by random network distillation. In experiments, our framework achieved outstanding results compared to state-of-the-art algorithms, while reducing the complexity of paths to chemical synthesis.
    Self-Supervised Inference in State-Space Models. (arXiv:2107.13349v3 [cs.LG] UPDATED)
    (2 min) We perform approximate inference in state-space models with nonlinear state transitions. Without parameterizing a generative model, we apply Bayesian update formulas using a local linearity approximation parameterized by neural networks. This comes accompanied by a maximum likelihood objective that requires no supervision via uncorrupt observations or ground truth latent states. The optimization backpropagates through a recursion similar to the classical Kalman filter and smoother. Additionally, using an approximate conditional independence, we can perform smoothing without having to parameterize a separate model. In scientific applications, domain knowledge can give a linear approximation of the latent transition maps, which we can easily incorporate into our model. Usage of such domain knowledge is reflected in excellent results (despite our model's simplicity) on the chaotic Lorenz system compared to fully supervised and variational inference methods. Finally, we show competitive results on an audio denoising experiment.
    Inductive Conformal Recommender System. (arXiv:2109.08949v2 [cs.LG] UPDATED)
    (2 min) Traditional recommendation algorithms develop techniques that can help people to choose desirable items. However, in many real-world applications, along with a set of recommendations, it is also essential to quantify each recommendation's (un)certainty. The conformal recommender system uses the experience of a user to output a set of recommendations, each associated with a precise confidence value. Given a significance level $\varepsilon$, it provides a bound $\varepsilon$ on the probability of making a wrong recommendation. The conformal framework uses a key concept called \emph{nonconformity measure} that measures the strangeness of an item concerning other items. One of the significant design challenges of any conformal recommendation framework is integrating nonconformity measures with the recommendation algorithm. This paper introduces an inductive variant of a conformal recommender system. We propose and analyze different nonconformity measures in the inductive setting. We also provide theoretical proofs on the error-bound and the time complexity. Extensive empirical analysis on ten benchmark datasets demonstrates that the inductive variant substantially improves the performance in computation time while preserving the accuracy.
    Training a Resilient Q-Network against Observational Interference. (arXiv:2102.09677v3 [cs.LG] UPDATED)
    (2 min) Deep reinforcement learning (DRL) has demonstrated impressive performance in various gaming simulators and real-world applications. In practice, however, a DRL agent may receive faulty observation by abrupt interferences such as black-out, frozen-screen, and adversarial perturbation. How to design a resilient DRL algorithm against these rare but mission-critical and safety-crucial scenarios is an essential yet challenging task. In this paper, we consider a deep q-network (DQN) framework training with an auxiliary task of observational interferences such as artificial noises. Inspired by causal inference for observational interference, we propose a causal inference based DQN algorithm called causal inference Q-network (CIQ). We evaluate the performance of CIQ in several benchmark DQN environments with different types of interferences as auxiliary labels. Our experimental results show that the proposed CIQ method could achieve higher performance and more resilience against observational interferences.
    BERT Attends the Conversation: Improving Low-Resource Conversational ASR. (arXiv:2110.02267v2 [cs.CL] UPDATED)
    (2 min) We propose new, data-efficient training tasks for BERT models that improve performance of automatic speech recognition (ASR) systems on conversational speech. We include past conversational context and fine-tune BERT on transcript disambiguation without external data to rescore ASR candidates. Our results show word error rate recoveries up to 37.2%. We test our methods in low-resource data domains, both in language (Norwegian), tone (spontaneous, conversational), and topics (parliament proceedings and customer service phone calls). These techniques are applicable to any ASR system and do not require any additional data, provided a pre-trained BERT model. We also show how the performance of our context-augmented rescoring methods strongly depends on the degree of spontaneity and nature of the conversation.
    Optimisation of Structured Neural Controller Based on Continuous-Time Policy Gradient. (arXiv:2201.06262v2 [cs.LG] UPDATED)
    (2 min) This study presents a policy optimisation framework for structured nonlinear control of continuous-time (deterministic) dynamic systems. The proposed approach prescribes a structure for the controller based on relevant scientific knowledge (such as Lyapunov stability theory or domain experiences) while considering the tunable elements inside the given structure as the point of parametrisation with neural networks. To optimise a cost represented as a function of the neural network weights, the proposed approach utilises the continuous-time policy gradient method based on adjoint sensitivity analysis as a means for correct and performant computation of cost gradient. This enables combining the stability, robustness, and physical interpretability of an analytically-derived structure for the feedback controller with the representational flexibility and optimised resulting performance provided by machine learning techniques. Such a hybrid paradigm for fixed-structure control synthesis is particularly useful for optimising adaptive nonlinear controllers to achieve improved performance in online operation, an area where the existing theory prevails the design of structure while lacking clear analytical understandings about tuning of the gains and the uncertainty model basis functions that govern the performance characteristics. Numerical experiments on aerospace applications illustrate the utility of the structured nonlinear controller optimisation framework.
    Automatic tumour segmentation in H&E-stained whole-slide images of the pancreas. (arXiv:2112.01533v2 [eess.IV] UPDATED)
    (2 min) Pancreatic cancer will soon be the second leading cause of cancer-related death in Western society. Imaging techniques such as CT, MRI and ultrasound typically help providing the initial diagnosis, but histopathological assessment is still the gold standard for final confirmation of disease presence and prognosis. In recent years machine learning approaches and pathomics pipelines have shown potential in improving diagnostics and prognostics in other cancerous entities, such as breast and prostate cancer. A crucial first step in these pipelines is typically identification and segmentation of the tumour area. Ideally this step is done automatically to prevent time consuming manual annotation. We propose a multi-task convolutional neural network to balance disease detection and segmentation accuracy. We validated our approach on a dataset of 29 patients (for a total of 58 slides) at different resolutions. The best single task segmentation network achieved a median Dice of 0.885 (0.122) IQR at a resolution of 15.56 $\mu$m. Our multi-task network improved on that with a median Dice score of 0.934 (0.077) IQR.
    Symmetry Perception by Deep Networks: Inadequacy of Feed-Forward Architectures and Improvements with Recurrent Connections. (arXiv:2112.04162v2 [cs.CV] CROSS LISTED)
    (2 min) Symmetry is omnipresent in nature and perceived by the visual system of many species, as it facilitates detecting ecologically important classes of objects in our environment. Symmetry perception requires abstraction of long-range spatial dependencies between image regions, and its underlying neural mechanisms remain elusive. In this paper, we evaluate Deep Neural Network (DNN) architectures on the task of learning symmetry perception from examples. We demonstrate that feed-forward DNNs that excel at modelling human performance on object recognition tasks, are unable to acquire a general notion of symmetry. This is the case even when the DNNs are architected to capture long-range spatial dependencies, such as through `dilated' convolutions and the recently introduced `transformers' design. By contrast, we find that recurrent architectures are capable of learning to perceive symmetry by decomposing the long-range spatial dependencies into a sequence of local operations, that are reusable for novel images. These results suggest that recurrent connections likely play an important role in symmetry perception in artificial systems, and possibly, biological ones too.
    CAE-Transformer: Transformer-based Model to Predict Invasiveness of Lung Adenocarcinoma Subsolid Nodules from Non-thin Section 3D CT Scans. (arXiv:2110.08721v3 [eess.IV] UPDATED)
    (2 min) Lung cancer is the leading cause of mortality from cancer worldwide and has various histologic types, among which Lung Adenocarcinoma (LUAC) has recently been the most prevalent one. The current approach to determine the invasiveness of LUACs is surgical resection, which is not a viable solution to fight lung cancer in a timely fashion. An alternative approach is to analyze chest Computed Tomography (CT) scans. The radiologists' analysis based on CT images, however, is subjective and might result in a low accuracy. In this paper, a transformer-based framework, referred to as the "CAE-Transformer", is developed to efficiently classify LUACs using whole CT images instead of finely annotated nodules. The proposed CAE-Transformer can achieve high accuracy over a small dataset and requires minor supervision from radiologists. The CAE Transformer utilizes an encoder to automatically extract informative features from CT slices, which are then fed to a modified transformer to capture global inter-slice relations and provide classification labels. Experimental results on our in-house dataset of 114 pathologically proven Sub-Solid Nodules (SSNs) demonstrate the superiority of the CAE-Transformer over its counterparts, achieving an accuracy of 87.73%, sensitivity of 88.67%, specificity of 86.33%, and AUC of 0.913, using a 10-fold cross-validation.
    Input correlations impede suppression of chaos and learning in balanced rate networks. (arXiv:2201.09916v1 [q-bio.NC])
    (2 min) Neural circuits exhibit complex activity patterns, both spontaneously and evoked by external stimuli. Information encoding and learning in neural circuits depend on how well time-varying stimuli can control spontaneous network activity. We show that in firing-rate networks in the balanced state, external control of recurrent dynamics, i.e., the suppression of internally-generated chaotic variability, strongly depends on correlations in the input. A unique feature of balanced networks is that, because common external input is dynamically canceled by recurrent feedback, it is far easier to suppress chaos with independent inputs into each neuron than through common input. To study this phenomenon we develop a non-stationary dynamic mean-field theory that determines how the activity statistics and largest Lyapunov exponent depend on frequency and amplitude of the input, recurrent coupling strength, and network size, for both common and independent input. We also show that uncorrelated inputs facilitate learning in balanced networks.
    Recommendation Unlearning. (arXiv:2201.06820v2 [cs.IR] UPDATED)
    (2 min) Recommender systems provide essential web services by learning users' personal preferences from collected data. However, in many cases, systems also need to forget some training data. From the perspective of privacy, several privacy regulations have recently been proposed, requiring systems to eliminate any impact of the data whose owner requests to forget. From the perspective of utility, if a system's utility is damaged by some bad data, the system needs to forget these data to regain utility. From the perspective of usability, users can delete noise and incorrect entries so that a system can provide more useful recommendations. While unlearning is very important, it has not been well-considered in existing recommender systems. Although there are some researches have studied the problem of machine unlearning in the domains of image and text data, existing methods can not been directly applied to recommendation as they are unable to consider the collaborative information. In this paper, we propose RecEraser, a general and efficient machine unlearning framework tailored to recommendation task. The main idea of RecEraser is to partition the training set into multiple shards and train a constituent model for each shard. Specifically, to keep the collaborative information of the data, we first design three novel data partition algorithms to divide training data into balanced groups based on their similarity. Then, considering that different shard models do not uniformly contribute to the final prediction, we further propose an adaptive aggregation method to improve the global model utility. Experimental results on three public benchmarks show that RecEraser can not only achieve efficient unlearning, but also outperform the state-of-the-art unlearning methods in terms of model utility. The source code can be found at https://github.com/chenchongthu/Recommendation-Unlearning
    System-Agnostic Meta-Learning for MDP-based Dynamic Scheduling via Descriptive Policy. (arXiv:2201.07051v2 [cs.LG] UPDATED)
    (2 min) Dynamic scheduling is an important problem in applications from queuing to wireless networks. It addresses how to choose an item among multiple scheduling items in each timestep to achieve a long-term goal. Conventional approaches for dynamic scheduling find the optimal policy for a given specific system so that the policy from these approaches is usable only for the corresponding system characteristics. Hence, it is hard to use such approaches for a practical system in which system characteristics dynamically change. This paper proposes a novel policy structure for MDP-based dynamic scheduling, a descriptive policy, which has a system-agnostic capability to adapt to unseen system characteristics for an identical task (dynamic scheduling). To this end, the descriptive policy learns a system-agnostic scheduling principle--in a nutshell, "which condition of items should have a higher priority in scheduling". The scheduling principle can be applied to any system so that the descriptive policy learned in one system can be used for another system. Experiments with simple explanatory and realistic application scenarios demonstrate that it enables system-agnostic meta-learning with very little performance degradation compared with the system-specific conventional policies.
    Conditional entropy minimization principle for learning domain invariant representation features. (arXiv:2201.10460v1 [cs.LG])
    (2 min) Invariance principle-based methods, for example, Invariant Risk Minimization (IRM), have recently emerged as promising approaches for Domain Generalization (DG). Despite the promising theory, invariance principle-based approaches fail in common classification tasks due to the mixture of the true invariant features and the spurious invariant features. In this paper, we propose a framework based on the conditional entropy minimization principle to filter out the spurious invariant features leading to a new algorithm with a better generalization capability. We theoretically prove that under some particular assumptions, the representation function can precisely recover the true invariant features. In addition, we also show that the proposed approach is closely related to the well-known Information Bottleneck framework. Both the theoretical and numerical results are provided to justify our approach.
    Post-Hoc Explanations Fail to Achieve their Purpose in Adversarial Contexts. (arXiv:2201.10295v1 [cs.LG])
    (0 min) Existing and planned legislation stipulates various obligations to provide information about machine learning algorithms and their functioning, often interpreted as obligations to "explain". Many researchers suggest using post-hoc explanation algorithms for this purpose. In this paper, we combine legal, philosophical and technical arguments to show that post-hoc explanation algorithms are unsuitable to achieve the law's objectives. Indeed, most situations where explanations are requested are adversarial, meaning that the explanation provider and receiver have opposing interests and incentives, so that the provider might manipulate the explanation for her own ends. We show that this fundamental conflict cannot be resolved because of the high degree of ambiguity of post-hoc explanations in realistic application scenarios. As a consequence, post-hoc explanation algorithms are unsuitable to achieve the transparency objectives inherent to the legal norms. Instead, there is a need to more explicitly discuss the objectives underlying "explainability" obligations as these can often be better achieved through other mechanisms. There is an urgent need for a more open and honest discussion regarding the potential and limitations of post-hoc explanations in adversarial contexts, in particular in light of the current negotiations about the European Union's draft Artificial Intelligence Act.
    AggMatch: Aggregating Pseudo Labels for Semi-Supervised Learning. (arXiv:2201.10444v1 [cs.LG])
    (0 min) Semi-supervised learning (SSL) has recently proven to be an effective paradigm for leveraging a huge amount of unlabeled data while mitigating the reliance on large labeled data. Conventional methods focused on extracting a pseudo label from individual unlabeled data sample and thus they mostly struggled to handle inaccurate or noisy pseudo labels, which degenerate performance. In this paper, we address this limitation with a novel SSL framework for aggregating pseudo labels, called AggMatch, which refines initial pseudo labels by using different confident instances. Specifically, we introduce an aggregation module for consistency regularization framework that aggregates the initial pseudo labels based on the similarity between the instances. To enlarge the aggregation candidates beyond the mini-batch, we present a class-balanced confidence-aware queue built with the momentum model, encouraging to provide more stable and consistent aggregation. We also propose a novel uncertainty-based confidence measure for the pseudo label by considering the consensus among multiple hypotheses with different subsets of the queue. We conduct experiments to demonstrate the effectiveness of AggMatch over the latest methods on standard benchmarks and provide extensive analyses.
    Heat Conduction Plate Layout Optimization using Physics-driven Convolutional Neural Networks. (arXiv:2201.10002v1 [cs.LG])
    (0 min) The layout optimization of the heat conduction is essential during design in engineering, especially for thermal sensible products. When the optimization algorithm iteratively evaluates different loading cases, the traditional numerical simulation methods used usually lead to a substantial computational cost. To effectively reduce the computational effort, data-driven approaches are used to train a surrogate model as a mapping between the prescribed external loads and various geometry. However, the existing model are trained by data-driven methods which requires intensive training samples that from numerical simulations and not really effectively solve the problem. Choosing the steady heat conduction problems as examples, this paper proposes a Physics-driven Convolutional Neural Networks (PD-CNN) method to infer the physical field solutions for random varied loading cases. After that, the Particle Swarm Optimization (PSO) algorithm is used to optimize the sizes and the positions of the hole masks in the prescribed design domain, and the average temperature value of the entire heat conduction field is minimized, and the goal of minimizing heat transfer is achieved. Compared with the existing data-driven approaches, the proposed PD-CNN optimization framework not only predict field solutions that are highly consistent with conventional simulation results, but also generate the solution space with without any pre-obtained training data.
    Numerical Approximation of Partial Differential Equations by a Variable Projection Method with Artificial Neural Networks. (arXiv:2201.09989v1 [math.NA])
    (0 min) We present a method for solving linear and nonlinear PDEs based on the variable projection (VarPro) framework and artificial neural networks (ANN). For linear PDEs, enforcing the boundary/initial value problem on the collocation points leads to a separable nonlinear least squares problem about the network coefficients. We reformulate this problem by the VarPro approach to eliminate the linear output-layer coefficients, leading to a reduced problem about the hidden-layer coefficients only. The reduced problem is solved first by the nonlinear least squares method to determine the hidden-layer coefficients, and then the output-layer coefficients are computed by the linear least squares method. For nonlinear PDEs, enforcing the boundary/initial value problem on the collocation points leads to a nonlinear least squares problem that is not separable, which precludes the VarPro strategy for such problems. To enable the VarPro approach for nonlinear PDEs, we first linearize the problem with a Newton iteration, using a particular form of linearization. The linearized system is solved by the VarPro framework together with ANNs. Upon convergence of the Newton iteration, the network coefficients provide the representation of the solution field to the original nonlinear problem. We present ample numerical examples with linear and nonlinear PDEs to demonstrate the performance of the method herein. For smooth field solutions, the errors of the current method decrease exponentially as the number of collocation points or the number of output-layer coefficients increases. We compare the current method with the ELM method from a previous work. Under identical conditions and network configurations, the current method exhibits an accuracy significantly superior to the ELM method.
    Dynamics-Aware Comparison of Learned Reward Functions. (arXiv:2201.10081v1 [cs.LG])
    (0 min) The ability to learn reward functions plays an important role in enabling the deployment of intelligent agents in the real world. However, comparing reward functions, for example as a means of evaluating reward learning methods, presents a challenge. Reward functions are typically compared by considering the behavior of optimized policies, but this approach conflates deficiencies in the reward function with those of the policy search algorithm used to optimize it. To address this challenge, Gleave et al. (2020) propose the Equivalent-Policy Invariant Comparison (EPIC) distance. EPIC avoids policy optimization, but in doing so requires computing reward values at transitions that may be impossible under the system dynamics. This is problematic for learned reward functions because it entails evaluating them outside of their training distribution, resulting in inaccurate reward values that we show can render EPIC ineffective at comparing rewards. To address this problem, we propose the Dynamics-Aware Reward Distance (DARD), a new reward pseudometric. DARD uses an approximate transition model of the environment to transform reward functions into a form that allows for comparisons that are invariant to reward shaping while only evaluating reward functions on transitions close to their training distribution. Experiments in simulated physical domains demonstrate that DARD enables reliable reward comparisons without policy optimization and is significantly more predictive than baseline methods of downstream policy performance when dealing with learned reward functions.
    Analysis of various climate change parameters in India using machine learning. (arXiv:2201.10123v1 [stat.AP])
    (0 min) Climate change in India is one of the most alarming problems faced by our community. Due to adverse and sudden changes in climate in past few years, mankind is at threat. Various impacts of climate change include extreme heat, changing rainfall patterns, droughts, groundwater, glacier melt, sea-level rise, and many more. Machine Learning can be used to analyze and predict the graph of change using previous data and thus design a model which in the future can furthermore be used to catalyze impactful work of climate change and take steps in the direction to help India fight against the upcoming climate changes. In this paper, we have analyzed 17 climate change parameters about India. We have applied linear regression, exponential regression, and polynomial regression to the parameters and evaluated the results. Using the designed model, we will predict these parameters for the years 2025,2030, 2035. These predicted values will thus help our community to prevent and take actions against the adverse and hazardous effects on mankind. We have designed and created this model which provides accurate results regarding all 17 parameters. The predicted values will therefore help India to be well equipped against climate change. This data when made available to the people of India will help create awareness among them and will help us save our country from the haphazard effects of climate change.
    ShapeFormer: Transformer-based Shape Completion via Sparse Representation. (arXiv:2201.10326v1 [cs.CV])
    (0 min) We present ShapeFormer, a transformer-based network that produces a distribution of object completions, conditioned on incomplete, and possibly noisy, point clouds. The resultant distribution can then be sampled to generate likely completions, each exhibiting plausible shape details while being faithful to the input. To facilitate the use of transformers for 3D, we introduce a compact 3D representation, vector quantized deep implicit function, that utilizes spatial sparsity to represent a close approximation of a 3D shape by a short sequence of discrete variables. Experiments demonstrate that ShapeFormer outperforms prior art for shape completion from ambiguous partial inputs in terms of both completion quality and diversity. We also show that our approach effectively handles a variety of shape types, incomplete patterns, and real-world scans.
    Ordinal-Quadruplet: Retrieval of Missing Classes in Ordinal Time Series. (arXiv:2201.09907v1 [cs.LG])
    (0 min) In this paper, we propose an ordered time series classification framework that is robust against missing classes in the training data, i.e., during testing we can prescribe classes that are missing during training. This framework relies on two main components: (1) our newly proposed ordinal-quadruplet loss, which forces the model to learn latent representation while preserving the ordinal relation among labels, (2) testing procedure, which utilizes the property of latent representation (order preservation). We conduct experiments based on real world multivariate time series data and show the significant improvement in the prediction of missing labels even with 40% of the classes are missing from training. Compared with the well-known triplet loss optimization augmented with interpolation for missing information, in some cases, we nearly double the accuracy.
    Learning Optimal Fair Classification Trees. (arXiv:2201.09932v1 [cs.LG])
    (0 min) The increasing use of machine learning in high-stakes domains -- where people's livelihoods are impacted -- creates an urgent need for interpretable and fair algorithms. In these settings it is also critical for such algorithms to be accurate. With these needs in mind, we propose a mixed integer optimization (MIO) framework for learning optimal classification trees of fixed depth that can be conveniently augmented with arbitrary domain specific fairness constraints. We benchmark our method against the state-of-the-art approach for building fair trees on popular datasets; given a fixed discrimination threshold, our approach improves out-of-sample (OOS) accuracy by 2.3 percentage points on average and obtains a higher OOS accuracy on 88.9% of the experiments. We also incorporate various algorithmic fairness notions into our method, showcasing its versatile modeling power that allows decision makers to fine-tune the trade-off between accuracy and fairness.
    Anomaly detection using principles of human perception. (arXiv:2103.12323v3 [cs.CR] UPDATED)
    (2 min) In the fields of statistics and unsupervised machine learning a fundamental and well-studied problem is anomaly detection. Anomalies are difficult to define, yet many algorithms have been proposed. Underlying the approaches is the nebulous understanding that anomalies are rare, unusual or inconsistent with the majority of data. The present work provides a philosophical treatise to clearly define anomalies and develops an algorithm for their efficient detection with minimal user intervention. Inspired by the Gestalt School of Psychology and the Helmholtz principle of human perception, anomalies are assumed to be observations that are unexpected to occur with respect to certain groupings made by the majority of the data. Under appropriate random variable modelling anomalies are directly found in a set of data by a uniform and independent random assumption of the distribution of constituent elements of the observations, with anomalies corresponding to those observations where the expectation of the number of occurrences of the elements in a given view is $<1$. Starting from fundamental principles of human perception an unsupervised anomaly detection algorithm is developed that is simple, real-time and parameter-free. Experiments suggest it as a competing choice for univariate data with promising results on the detection of global anomalies in multivariate data.
    An Energy Consumption Model for Electrical Vehicle Networks via Extended Federated-learning. (arXiv:2111.08472v2 [eess.SY] UPDATED)
    (0 min) Electrical vehicle (EV) raises to promote an eco-sustainable society. Nevertheless, the "range anxiety" of EV hinders its wider acceptance among customers. This paper proposes a novel solution to range anxiety based on a federated-learning model, which is capable of estimating battery consumption and providing energy-efficient route planning for vehicle networks. Specifically, the new approach extends the federated-learning structure with two components: anomaly detection and sharing policy. The first component identifies preventing factors in model learning, while the second component offers guidelines for information sharing amongst vehicle networks when the sharing is necessary to preserve learning efficiency. The two components collaborate to enhance learning robustness against data heterogeneities in networks. Numerical experiments are conducted, and the results show that compared with considered solutions, the proposed approach could provide higher accuracy of battery-consumption estimation for vehicles under heterogeneous data distributions, without increasing the time complexity or transmitting raw data among vehicle networks.
    Leveraging Structural Properties of Source Code Graphs for Just-In-Time Bug Prediction. (arXiv:2201.10137v1 [cs.SE])
    (0 min) The most common use of data visualization is to minimize the complexity for proper understanding. A graph is one of the most commonly used representations for understanding relational data. It produces a simplified representation of data that is challenging to comprehend if kept in a textual format. In this study, we propose a methodology to utilize the relational properties of source code in the form of a graph to identify Just-in-Time (JIT) bug prediction in software systems during different revisions of software evolution and maintenance. We presented a method to convert the source codes of commit patches to equivalent graph representations and named it Source Code Graph (SCG). To understand and compare multiple source code graphs, we extracted several structural properties of these graphs, such as the density, number of cycles, nodes, edges, etc. We then utilized the attribute values of those SCGs to visualize and detect buggy software commits. We process more than 246K software commits from 12 subject systems in this investigation. Our investigation on these 12 open-source software projects written in C++ and Java programming languages shows that if we combine the features from SCG with conventional features used in similar studies, we will get the increased performance of Machine Learning (ML) based buggy commit detection models. We also find the increase of F1~Scores in predicting buggy and non-buggy commits statistically significant using the Wilcoxon Signed Rank Test. Since SCG-based feature values represent the style or structural properties of source code updates or changes in the software system, it suggests the importance of careful maintenance of source code style or structure for keeping a software system bug-free.
    Learning Contextual Bandits Through Perturbed Rewards. (arXiv:2201.09910v1 [cs.LG])
    (0 min) Thanks to the power of representation learning, neural contextual bandit algorithms demonstrate remarkable performance improvement against their classical counterparts. But because their exploration has to be performed in the entire neural network parameter space to obtain nearly optimal regret, the resulting computational cost is prohibitively high. We perturb the rewards when updating the neural network to eliminate the need of explicit exploration and the corresponding computational overhead. We prove that a $\tilde{O}(\tilde{d}\sqrt{T})$ regret upper bound is still achievable under standard regularity conditions, where $T$ is the number of rounds of interactions and $\tilde{d}$ is the effective dimension of a neural tangent kernel matrix. Extensive comparisons with several benchmark contextual bandit algorithms, including two recent neural contextual bandit models, demonstrate the effectiveness and computational efficiency of our proposed neural bandit algorithm.
    SILG: The Multi-environment Symbolic Interactive Language Grounding Benchmark. (arXiv:2110.10661v2 [cs.CL] UPDATED)
    (2 min) Existing work in language grounding typically study single environments. How do we build unified models that apply across multiple environments? We propose the multi-environment Symbolic Interactive Language Grounding benchmark (SILG), which unifies a collection of diverse grounded language learning environments under a common interface. SILG consists of grid-world environments that require generalization to new dynamics, entities, and partially observed worlds (RTFM, Messenger, NetHack), as well as symbolic counterparts of visual worlds that require interpreting rich natural language with respect to complex scenes (ALFWorld, Touchdown). Together, these environments provide diverse grounding challenges in richness of observation space, action space, language specification, and plan complexity. In addition, we propose the first shared model architecture for RL on these environments, and evaluate recent advances such as egocentric local convolution, recurrent state-tracking, entity-centric attention, and pretrained LM using SILG. Our shared architecture achieves comparable performance to environment-specific architectures. Moreover, we find that many recent modelling advances do not result in significant gains on environments other than the one they were designed for. This highlights the need for a multi-environment benchmark. Finally, the best models significantly underperform humans on SILG, which suggests ample room for future work. We hope SILG enables the community to quickly identify new methodologies for language grounding that generalize to a diverse set of environments and their associated challenges.
    Deep Learning for Android Malware Defenses: a Systematic Literature Review. (arXiv:2103.05292v2 [cs.CR] UPDATED)
    (0 min) Malicious applications (particularly those targeting the Android platform) pose a serious threat to developers and end-users. Numerous research efforts have been devoted to developing effective approaches to defend against Android malware. However, given the explosive growth of Android malware and the continuous advancement of malicious evasion technologies like obfuscation and reflection, Android malware defense approaches based on manual rules or traditional machine learning may not be effective. In recent years, a dominant research field called deep learning (DL), which provides a powerful feature abstraction ability, has demonstrated a compelling and promising performance in a variety of areas, like natural language processing and computer vision. To this end, employing deep learning techniques to thwart Android malware attacks has recently garnered considerable research attention. Yet, no systematic literature review focusing on deep learning approaches for Android Malware defenses exists. In this paper, we conducted a systematic literature review to search and analyze how deep learning approaches have been applied in the context of malware defenses in the Android environment. As a result, a total of 132 studies covering the period 2014-2021 were identified. Our investigation reveals that, while the majority of these sources mainly consider DL-based on Android malware detection, 53 primary studies (40.1 percent) design defense approaches based on other scenarios. This review also discusses research trends, research focuses, challenges, and future research directions in DL-based Android malware defenses.
    Sparse bottleneck neural networks for exploratory non-linear visualization of Patch-seq data. (arXiv:2006.10411v3 [cs.LG] UPDATED)
    (2 min) Patch-seq, a recently developed experimental technique, allows neuroscientists to obtain transcriptomic and electrophysiological information from the same neurons. Efficiently analyzing and visualizing such paired multivariate data in order to extract biologically meaningful interpretations has, however, remained a challenge. Here, we use sparse deep neural networks with and without a two-dimensional bottleneck to predict electrophysiological features from the transcriptomic ones using a group lasso penalty, yielding concise and biologically interpretable two-dimensional visualizations. In two large example data sets, this visualization reveals known neural classes and their marker genes without biological prior knowledge. We also demonstrate that our method is applicable to other kinds of multimodal data, such as paired transcriptomic and proteomic measurements provided by CITE-seq.
    Transformers Can Do Bayesian Inference. (arXiv:2112.10510v2 [cs.LG] UPDATED)
    (2 min) Currently, it is hard to reap the benefits of deep learning for Bayesian methods, which allow the explicit specification of prior knowledge and accurately capture model uncertainty. We present Prior-Data Fitted Networks (PFNs). PFNs leverage large-scale machine learning techniques to approximate a large set of posteriors. The only requirement for PFNs to work is the ability to sample from a prior distribution over supervised learning tasks (or functions). Our method restates the objective of posterior approximation as a supervised classification problem with a set-valued input: it repeatedly draws a task (or function) from the prior, draws a set of data points and their labels from it, masks one of the labels and learns to make probabilistic predictions for it based on the set-valued input of the rest of the data points. Presented with a set of samples from a new supervised learning task as input, PFNs make probabilistic predictions for arbitrary other data points in a single forward propagation, having learned to approximate Bayesian inference. We demonstrate that PFNs can near-perfectly mimic Gaussian processes and also enable efficient Bayesian inference for intractable problems, with over 200-fold speedups in multiple setups compared to current methods. We obtain strong results in very diverse areas such as Gaussian process regression, Bayesian neural networks, classification for small tabular data sets, and few-shot image classification, demonstrating the generality of PFNs. Code and trained PFNs are released at https://github.com/automl/TransformersCanDoBayesianInference.
    Sharing to learn and learning to share -- Fitting together Meta-Learning, Multi-Task Learning, and Transfer Learning: A meta review. (arXiv:2111.12146v2 [cs.LG] UPDATED)
    (2 min) Integrating knowledge across different domains is an essential feature of human learning. Learning paradigms like transfer learning, meta learning, and multi-task learning reflect the human learning process by exploiting the prior knowledge for new tasks, encouraging faster learning and good generalization for new tasks. This article gives a detailed view of these learning paradigms along with a comparative analysis. The weakness of a learning algorithm turns out to be the strength of another, and thereby merging them is a prevalent trait in the literature. This work delivers a literature review of the articles, which fuses two algorithms to accomplish multiple tasks. A global generic learning network, an ensemble of meta learning, transfer learning, and multi-task learning, is also introduced here, along with some open research questions and directions for future research.
    Learning Stable Classifiers by Transferring Unstable Features. (arXiv:2106.07847v3 [cs.LG] UPDATED)
    (2 min) While unbiased machine learning models are essential for many applications, bias is a human-defined concept that can vary across tasks. Given only input-label pairs, algorithms may lack sufficient information to distinguish stable (causal) features from unstable (spurious) features. However, related tasks often share similar biases -- an observation we may leverage to develop stable classifiers in the transfer setting. In this work, we explicitly inform the target classifier about unstable features in the source tasks. Specifically, we derive a representation that encodes the unstable features by contrasting different data environments in the source task. We achieve robustness by clustering data of the target task according to this representation and minimizing the worst-case risk across these clusters. We evaluate our method on both text and image classifications. Empirical results demonstrate that our algorithm is able to maintain robustness on the target task for both synthetically generated environments and real-world environments.
    Deep Reinforcement Model Selection for Communications Resource Allocation in On-Site Medical Care. (arXiv:2111.06680v2 [cs.LG] UPDATED)
    (2 min) Greater capabilities of mobile communications technology enable interconnection of on-site medical care at a scale previously unavailable. However, embedding such critical, demanding tasks into the already complex infrastructure of mobile communications proves challenging. This paper explores a resource allocation scenario where a scheduler must balance mixed performance metrics among connected users. To fulfill this resource allocation task, we present a scheduler that adaptively switches between different model-based scheduling algorithms. We make use of a deep Q-Network to learn the benefit of selecting a scheduling paradigm for a given situation, combining advantages from model-driven and data-driven approaches. The resulting ensemble scheduler is able to combine its constituent algorithms to maximize a sum-utility cost function while ensuring performance on designated high-priority users.
    A Classical Approach to Handcrafted Feature Extraction Techniques for Bangla Handwritten Digit Recognition. (arXiv:2201.10102v1 [cs.CV])
    (0 min) Bangla Handwritten Digit recognition is a significant step forward in the development of Bangla OCR. However, intricate shape, structural likeness and distinctive composition style of Bangla digits makes it relatively challenging to distinguish. Thus, in this paper, we benchmarked four rigorous classifiers to recognize Bangla Handwritten Digit: K-Nearest Neighbor (KNN), Support Vector Machine (SVM), Random Forest (RF), and Gradient-Boosted Decision Trees (GBDT) based on three handcrafted feature extraction techniques: Histogram of Oriented Gradients (HOG), Local Binary Pattern (LBP), and Gabor filter on four publicly available Bangla handwriting digits datasets: NumtaDB, CMARTdb, Ekush and BDRW. Here, handcrafted feature extraction methods are used to extract features from the dataset image, which are then utilized to train machine learning classifiers to identify Bangla handwritten digits. We further fine-tuned the hyperparameters of the classification algorithms in order to acquire the finest Bangla handwritten digits recognition performance from these algorithms, and among all the models we employed, the HOG features combined with SVM model (HOG+SVM) attained the best performance metrics across all datasets. The recognition accuracy of the HOG+SVM method on the NumtaDB, CMARTdb, Ekush and BDRW datasets reached 93.32%, 98.08%, 95.68% and 89.68%, respectively as well as we compared the model performance with recent state-of-art methods.
    GENESIS-V2: Inferring Unordered Object Representations without Iterative Refinement. (arXiv:2104.09958v3 [cs.CV] UPDATED)
    (0 min) Advances in unsupervised learning of object-representations have culminated in the development of a broad range of methods for unsupervised object segmentation and interpretable object-centric scene generation. These methods, however, are limited to simulated and real-world datasets with limited visual complexity. Moreover, object representations are often inferred using RNNs which do not scale well to large images or iterative refinement which avoids imposing an unnatural ordering on objects in an image but requires the a priori initialisation of a fixed number of object representations. In contrast to established paradigms, this work proposes an embedding-based approach in which embeddings of pixels are clustered in a differentiable fashion using a stochastic stick-breaking process. Similar to iterative refinement, this clustering procedure also leads to randomly ordered object representations, but without the need of initialising a fixed number of clusters a priori. This is used to develop a new model, GENESIS-v2, which can infer a variable number of object representations without using RNNs or iterative refinement. We show that GENESIS-v2 performs strongly in comparison to recent baselines in terms of unsupervised image segmentation and object-centric scene generation on established synthetic datasets as well as more complex real-world datasets.
    UCTransNet: Rethinking the Skip Connections in U-Net from a Channel-wise Perspective with Transformer. (arXiv:2109.04335v3 [cs.CV] UPDATED)
    (0 min) Most recent semantic segmentation methods adopt a U-Net framework with an encoder-decoder architecture. It is still challenging for U-Net with a simple skip connection scheme to model the global multi-scale context: 1) Not each skip connection setting is effective due to the issue of incompatible feature sets of encoder and decoder stage, even some skip connection negatively influence the segmentation performance; 2) The original U-Net is worse than the one without any skip connection on some datasets. Based on our findings, we propose a new segmentation framework, named UCTransNet (with a proposed CTrans module in U-Net), from the channel perspective with attention mechanism. Specifically, the CTrans module is an alternate of the U-Net skip connections, which consists of a sub-module to conduct the multi-scale Channel Cross fusion with Transformer (named CCT) and a sub-module Channel-wise Cross-Attention (named CCA) to guide the fused multi-scale channel-wise information to effectively connect to the decoder features for eliminating the ambiguity. Hence, the proposed connection consisting of the CCT and CCA is able to replace the original skip connection to solve the semantic gaps for an accurate automatic medical image segmentation. The experimental results suggest that our UCTransNet produces more precise segmentation performance and achieves consistent improvements over the state-of-the-art for semantic segmentation across different datasets and conventional architectures involving transformer or U-shaped framework. Code: https://github.com/McGregorWwww/UCTransNet.
    Active Sensing for Communications by Learning. (arXiv:2112.04075v2 [cs.IT] UPDATED)
    (0 min) This paper proposes a deep learning approach to a class of active sensing problems in wireless communications in which an agent sequentially interacts with an environment over a predetermined number of time frames to gather information in order to perform a sensing or actuation task for maximizing some utility function. In such an active learning setting, the agent needs to design an adaptive sensing strategy sequentially based on the observations made so far. To tackle such a challenging problem in which the dimension of historical observations increases over time, we propose to use a long short-term memory (LSTM) network to exploit the temporal correlations in the sequence of observations and to map each observation to a fixed-size state information vector. We then use a deep neural network (DNN) to map the LSTM state at each time frame to the design of the next measurement step. Finally, we employ another DNN to map the final LSTM state to the desired solution. We investigate the performance of the proposed framework for adaptive channel sensing problems in wireless communications. In particular, we consider the adaptive beamforming problem for mmWave beam alignment and the adaptive reconfigurable intelligent surface sensing problem for reflection alignment. Numerical results demonstrate that the proposed deep active sensing strategy outperforms the existing adaptive or nonadaptive sensing schemes.
    Benchmarking Uncertainty Quantification on Biosignal Classification Tasks under Dataset Shift. (arXiv:2112.09196v2 [cs.LG] UPDATED)
    (0 min) A biosignal is a signal that can be continuously measured from human bodies, such as respiratory sounds, heart activity (ECG), brain waves (EEG), etc, based on which, machine learning models have been developed with very promising performance for automatic disease detection and health status monitoring. However, dataset shift, i.e., data distribution of inference varies from the distribution of the training, is not uncommon for real biosignal-based applications. To improve the robustness, probabilistic models with uncertainty quantification are adapted to capture how reliable a prediction is. Yet, assessing the quality of the estimated uncertainty remains a challenge. In this work, we propose a framework to evaluate the capability of the estimated uncertainty in capturing different types of biosignal dataset shifts with various degrees. In particular, we use three classification tasks based on respiratory sounds and electrocardiography signals to benchmark five representative uncertainty quantification methods. Extensive experiments show that, although Ensemble and Bayesian models could provide relatively better uncertainty estimations under dataset shifts, all tested models fail to meet the promise in trustworthy prediction and model calibration. Our work paves the way for a comprehensive evaluation for any newly developed biosignal classifiers.
    Augmented RBMLE-UCB Approach for Adaptive Control of Linear Quadratic Systems. (arXiv:2201.10542v1 [math.OC])
    (2 min) We consider the problem of controlling a stochastic linear system with quadratic costs, when its system parameters are not known to the agent -- called the adaptive LQG control problem. We re-examine an approach called "Reward-Biased Maximum Likelihood Estimate" (RBMLE) that was proposed more than forty years ago, and which predates the "Upper Confidence Bound" (UCB) method as well as the definition of "regret". It simply added a term favoring parameters with larger rewards to the estimation criterion. We propose an augmented approach that combines the penalty of the RBMLE method with the constraint of the UCB method, uniting the two approaches to optimization in the face of uncertainty. We first establish that theoretically this method retains $\mathcal{O}(\sqrt{T})$ regret, the best known so far. We show through a comprehensive simulation study that this augmented RBMLE method considerably outperforms the UCB and Thompson sampling approaches, with a regret that is typically less than 50\% of the better of their regrets. The simulation study includes all examples from earlier papers as well as a large collection of randomly generated systems.
    Contrastive Representation Distillation. (arXiv:1910.10699v3 [cs.LG] UPDATED)
    (0 min) Often we wish to transfer representational knowledge from one neural network to another. Examples include distilling a large network into a smaller one, transferring knowledge from one sensory modality to a second, or ensembling a collection of models into a single estimator. Knowledge distillation, the standard approach to these problems, minimizes the KL divergence between the probabilistic outputs of a teacher and student network. We demonstrate that this objective ignores important structural knowledge of the teacher network. This motivates an alternative objective by which we train a student to capture significantly more information in the teacher's representation of the data. We formulate this objective as contrastive learning. Experiments demonstrate that our resulting new objective outperforms knowledge distillation and other cutting-edge distillers on a variety of knowledge transfer tasks, including single model compression, ensemble distillation, and cross-modal transfer. Our method sets a new state-of-the-art in many transfer tasks, and sometimes even outperforms the teacher network when combined with knowledge distillation. Code: this http URL
    MQBench: Towards Reproducible and Deployable Model Quantization Benchmark. (arXiv:2111.03759v2 [cs.LG] UPDATED)
    (0 min) Model quantization has emerged as an indispensable technique to accelerate deep learning inference. While researchers continue to push the frontier of quantization algorithms, existing quantization work is often unreproducible and undeployable. This is because researchers do not choose consistent training pipelines and ignore the requirements for hardware deployments. In this work, we propose Model Quantization Benchmark (MQBench), a first attempt to evaluate, analyze, and benchmark the reproducibility and deployability for model quantization algorithms. We choose multiple different platforms for real-world deployments, including CPU, GPU, ASIC, DSP, and evaluate extensive state-of-the-art quantization algorithms under a unified training pipeline. MQBench acts like a bridge to connect the algorithm and the hardware. We conduct a comprehensive analysis and find considerable intuitive or counter-intuitive insights. By aligning the training settings, we find existing algorithms have about the same performance on the conventional academic track. While for the hardware-deployable quantization, there is a huge accuracy gap which remains unsettled. Surprisingly, no existing algorithm wins every challenge in MQBench, and we hope this work could inspire future research directions.
    Fast Distributionally Robust Learning with Variance Reduced Min-Max Optimization. (arXiv:2104.13326v2 [cs.LG] UPDATED)
    (0 min) Distributionally robust supervised learning (DRSL) is emerging as a key paradigm for building reliable machine learning systems for real-world applications -- reflecting the need for classifiers and predictive models that are robust to the distribution shifts that arise from phenomena such as selection bias or nonstationarity. Existing algorithms for solving Wasserstein DRSL -- one of the most popular DRSL frameworks based around robustness to perturbations in the Wasserstein distance -- have serious limitations that limit their use in large-scale problems -- in particular they involve solving complex subproblems and they fail to make use of stochastic gradients. We revisit Wasserstein DRSL through the lens of min-max optimization and derive scalable and efficiently implementable stochastic extra-gradient algorithms which provably achieve faster convergence rates than existing approaches. We demonstrate their effectiveness on synthetic and real data when compared to existing DRSL approaches. Key to our results is the use of variance reduction and random reshuffling to accelerate stochastic min-max optimization, the analysis of which may be of independent interest.
    Regret Bounds for Gaussian-Process Optimization in Large Domains. (arXiv:2104.14113v3 [cs.LG] UPDATED)
    (0 min) The goal of this paper is to characterize Gaussian-Process optimization in the setting where the function domain is large relative to the number of admissible function evaluations, i.e., where it is impossible to find the global optimum. We provide upper bounds on the suboptimality (Bayesian simple regret) of the solution found by optimization strategies that are closely related to the widely used expected improvement (EI) and upper confidence bound (UCB) algorithms. These regret bounds illuminate the relationship between the number of evaluations, the domain size (i.e. cardinality of finite domains / Lipschitz constant of the covariance function in continuous domains), and the optimality of the retrieved function value. In particular, we show that even when the number of evaluations is far too small to find the global optimum, we can find nontrivial function values (e.g. values that achieve a certain ratio with the optimal value).
    Constrained Policy Optimization via Bayesian World Models. (arXiv:2201.09802v2 [cs.LG] UPDATED)
    (0 min) Improving sample-efficiency and safety are crucial challenges when deploying reinforcement learning in high-stakes real world applications. We propose LAMBDA, a novel model-based approach for policy optimization in safety critical tasks modeled via constrained Markov decision processes. Our approach utilizes Bayesian world models, and harnesses the resulting uncertainty to maximize optimistic upper bounds on the task objective, as well as pessimistic upper bounds on the safety constraints. We demonstrate LAMBDA's state of the art performance on the Safety-Gym benchmark suite in terms of sample efficiency and constraint violation.
    Adaptive Activation-based Structured Pruning. (arXiv:2201.10520v1 [cs.CV])
    (0 min) Pruning is a promising approach to compress complex deep learning models in order to deploy them on resource-constrained edge devices. However, many existing pruning solutions are based on unstructured pruning, which yield models that cannot efficiently run on commodity hardware, and require users to manually explore and tune the pruning process, which is time consuming and often leads to sub-optimal results. To address these limitations, this paper presents an adaptive, activation-based, structured pruning approach to automatically and efficiently generate small, accurate, and hardware-efficient models that meet user requirements. First, it proposes iterative structured pruning using activation-based attention feature maps to effectively identify and prune unimportant filters. Then, it proposes adaptive pruning policies for automatically meeting the pruning objectives of accuracy-critical, memory-constrained, and latency-sensitive tasks. A comprehensive evaluation shows that the proposed method can substantially outperform the state-of-the-art structured pruning works on CIFAR-10 and ImageNet datasets. For example, on ResNet-56 with CIFAR-10, without any accuracy drop, our method achieves the largest parameter reduction (79.11%), outperforming the related works by 22.81% to 66.07%, and the largest FLOPs reduction (70.13%), outperforming the related works by 14.13% to 26.53%.
    Initial Investigations Towards Non-invasive Monitoring of Chronic Wound Healing Using Deep Learning and Ultrasound Imaging. (arXiv:2201.10511v1 [eess.IV])
    (0 min) Chronic wounds including diabetic and arterial/venous insufficiency injuries have become a major burden for healthcare systems worldwide. Demographic changes suggest that wound care will play an even bigger role in the coming decades. Predicting and monitoring response to therapy in wound care is currently largely based on visual inspection with little information on the underlying tissue. Thus, there is an urgent unmet need for innovative approaches that facilitate personalized diagnostics and treatments at the point-of-care. It has been recently shown that ultrasound imaging can monitor response to therapy in wound care, but this work required onerous manual image annotations. In this study, we present initial results of a deep learning-based automatic segmentation of cross-sectional wound size in ultrasound images and identify requirements and challenges for future research on this application. Evaluation of the segmentation results underscores the potential of the proposed deep learning approach to complement non-invasive imaging with Dice scores of 0.34 (U-Net, FCN) and 0.27 (ResNet-U-Net) but also highlights the need for improving robustness further. We conclude that deep learning-supported analysis of non-invasive ultrasound images is a promising area of research to automatically extract cross-sectional wound size and depth information with potential value in monitoring response to therapy.
    Beyond the Frontier: Fairness Without Privacy Loss. (arXiv:2201.10408v1 [cs.LG])
    (0 min) Notions of fair machine learning that seek to control various kinds of error across protected groups generally are cast as constrained optimization problems over a fixed model class. For such problems, tradeoffs arise: asking for various kinds of technical fairness requires compromising on overall error, and adding more protected groups increases error rates across all groups. Our goal is to break though such accuracy-fairness tradeoffs. We develop a simple algorithmic framework that allows us to deploy models and then revise them dynamically when groups are discovered on which the error rate is suboptimal. Protected groups don't need to be pre-specified: At any point, if it is discovered that there is some group on which our current model performs substantially worse than optimally, then there is a simple update operation that improves the error on that group without increasing either overall error or the error on previously identified groups. We do not restrict the complexity of the groups that can be identified, and they can intersect in arbitrary ways. The key insight that allows us to break through the tradeoff barrier is to dynamically expand the model class as new groups are identified. The result is provably fast convergence to a model that can't be distinguished from the Bayes optimal predictor, at least by those tasked with finding high error groups. We explore two instantiations of this framework: as a "bias bug bounty" design in which external auditors are invited to discover groups on which our current model's error is suboptimal, and as an algorithmic paradigm in which the discovery of groups on which the error is suboptimal is posed as an optimization problem. In the bias bounty case, when we say that a model cannot be distinguished from Bayes optimal, we mean by any participant in the bounty program. We provide both theoretical analysis and experimental validation.
    Rayleigh EigenDirections (REDs): GAN latent space traversals for multidimensional features. (arXiv:2201.10423v1 [cs.CV])
    (0 min) We present a method for finding paths in a deep generative model's latent space that can maximally vary one set of image features while holding others constant. Crucially, unlike past traversal approaches, ours can manipulate multidimensional features of an image such as facial identity and pixels within a specified region. Our method is principled and conceptually simple: optimal traversal directions are chosen by maximizing differential changes to one feature set such that changes to another set are negligible. We show that this problem is nearly equivalent to one of Rayleigh quotient maximization, and provide a closed-form solution to it based on solving a generalized eigenvalue equation. We use repeated computations of the corresponding optimal directions, which we call Rayleigh EigenDirections (REDs), to generate appropriately curved paths in latent space. We empirically evaluate our method using StyleGAN2 on two image domains: faces and living rooms. We show that our method is capable of controlling various multidimensional features out of the scope of previous latent space traversal methods: face identity, spatial frequency bands, pixels within a region, and the appearance and position of an object. Our work suggests that a wealth of opportunities lies in the local analysis of the geometry and semantics of latent spaces.
    On-Device Learning with Cloud-Coordinated Data Augmentation for Extreme Model Personalization in Recommender Systems. (arXiv:2201.10382v1 [cs.LG])
    (0 min) Data heterogeneity is an intrinsic property of recommender systems, making models trained over the global data on the cloud, which is the mainstream in industry, non-optimal to each individual user's local data distribution. To deal with data heterogeneity, model personalization with on-device learning is a potential solution. However, on-device training using a user's small size of local samples will incur severe overfitting and undermine the model's generalization ability. In this work, we propose a new device-cloud collaborative learning framework, called CoDA, to break the dilemmas of purely cloud-based learning and on-device learning. The key principle of CoDA is to retrieve similar samples from the cloud's global pool to augment each user's local dataset to train the recommendation model. Specifically, after a coarse-grained sample matching on the cloud, a personalized sample classifier is further trained on each device for a fine-grained sample filtering, which can learn the boundary between the local data distribution and the outside data distribution. We also build an end-to-end pipeline to support the flows of data, model, computation, and control between the cloud and each device. We have deployed CoDA in a recommendation scenario of Mobile Taobao. Online A/B testing results show the remarkable performance improvement of CoDA over both cloud-based learning without model personalization and on-device training without data augmentation. Overhead testing on a real device demonstrates the computation, storage, and communication efficiency of the on-device tasks in CoDA.
    A Method to Predict Semantic Relations on Artificial Intelligence Papers. (arXiv:2201.10518v1 [cs.SI])
    (0 min) Predicting the emergence of links in large evolving networks is a difficult task with many practical applications. Recently, the Science4cast competition has illustrated this challenge presenting a network of 64.000 AI concepts and asking the participants to predict which topics are going to be researched together in the future. In this paper, we present a solution to this problem based on a new family of deep learning approaches, namely Graph Neural Networks. The results of the challenge show that our solution is competitive even if we had to impose severe restrictions to obtain a computationally efficient and parsimonious model: ignoring the intrinsic dynamics of the graph and using only a small subset of the nodes surrounding a target link. Preliminary experiments presented in this paper suggest the model is learning two related, but different patterns: the absorption of a node by a sub-graph and union of more dense sub-graphs. The model seems to excel at recognizing the first type of pattern.
    Interpretability in Convolutional Neural Networks for Building Damage Classification in Satellite Imagery. (arXiv:2201.10523v1 [cs.CV])
    (0 min) Natural disasters ravage the world's cities, valleys, and shores on a regular basis. Deploying precise and efficient computational mechanisms for assessing infrastructure damage is essential to channel resources and minimize the loss of life. Using a dataset that includes labeled pre- and post- disaster satellite imagery, we take a machine learning-based remote sensing approach and train multiple convolutional neural networks (CNNs) to assess building damage on a per-building basis. We present a novel methodology of interpretable deep learning that seeks to explicitly investigate the most useful modalities of information in the training data to create an accurate classification model. We also investigate which loss functions best optimize these models. Our findings include that ordinal-cross entropy loss is the most optimal criterion for optimization to use and that including the type of disaster that caused the damage in combination with pre- and post-disaster training data most accurately predicts the level of damage caused. Further, we make progress in the qualitative representation of which parts of the images that the model is using to predict damage levels, through gradient-weighted class activation mapping (Grad-CAM). Our research seeks to computationally contribute to aiding in this ongoing and growing humanitarian crisis, heightened by anthropogenic climate change.
    Roadmap for Cybersecurity in Autonomous Vehicles. (arXiv:2201.10349v1 [cs.CR])
    (0 min) Autonomous vehicles are on the horizon and will be transforming transportation safety and comfort. These vehicles will be connected to various external systems and utilize advanced embedded systems to perceive their environment and make intelligent decisions. However, this increased connectivity makes these vehicles vulnerable to various cyber-attacks that can have catastrophic effects. Attacks on automotive systems are already on the rise in today's vehicles and are expected to become more commonplace in future autonomous vehicles. Thus, there is a need to strengthen cybersecurity in future autonomous vehicles. In this article, we discuss major automotive cyber-attacks over the past decade and present state-of-the-art solutions that leverage artificial intelligence (AI). We propose a roadmap towards building secure autonomous vehicles and highlight key open challenges that need to be addressed.
    Turn-to-Diarize: Online Speaker Diarization Constrained by Transformer Transducer Speaker Turn Detection. (arXiv:2109.11641v3 [eess.AS] UPDATED)
    (0 min) In this paper, we present a novel speaker diarization system for streaming on-device applications. In this system, we use a transformer transducer to detect the speaker turns, represent each speaker turn by a speaker embedding, then cluster these embeddings with constraints from the detected speaker turns. Compared with conventional clustering-based diarization systems, our system largely reduces the computational cost of clustering due to the sparsity of speaker turns. Unlike other supervised speaker diarization systems which require annotations of time-stamped speaker labels for training, our system only requires including speaker turn tokens during the transcribing process, which largely reduces the human efforts involved in data collection.
    What's Wrong with Deep Learning in Tree Search for Combinatorial Optimization. (arXiv:2201.10494v1 [cs.LG])
    (0 min) Combinatorial optimization lies at the core of many real-world problems. Especially since the rise of graph neural networks (GNNs), the deep learning community has been developing solvers that derive solutions to NP-hard problems by learning the problem-specific solution structure. However, reproducing the results of these publications proves to be difficult. We make three contributions. First, we present an open-source benchmark suite for the NP-hard Maximum Independent Set problem, in both its weighted and unweighted variants. The suite offers a unified interface to various state-of-the-art traditional and machine learning-based solvers. Second, using our benchmark suite, we conduct an in-depth analysis of the popular guided tree search algorithm by Li et al. [NeurIPS 2018], testing various configurations on small and large synthetic and real-world graphs. By re-implementing their algorithm with a focus on code quality and extensibility, we show that the graph convolution network used in the tree search does not learn a meaningful representation of the solution structure, and can in fact be replaced by random values. Instead, the tree search relies on algorithmic techniques like graph kernelization to find good solutions. Thus, the results from the original publication are not reproducible. Third, we extend the analysis to compare the tree search implementations to other solvers, showing that the classical algorithmic solvers often are faster, while providing solutions of similar quality. Additionally, we analyze a recent solver based on reinforcement learning and observe that for this solver, the GNN is responsible for the competitive solution quality.
    Interspecies Collaboration in the Design of Visual Identity: A Case Study. (arXiv:2201.10393v1 [cs.LG])
    (0 min) Design usually relies on human ingenuity, but the past decade has seen the field's toolbox expanding to Artificial Intelligence (AI) and its adjacent methods, making room for hybrid, algorithmic creations. This article aims to substantiate the concept of interspecies collaboration - that of natural and artificial intelligence - in the active co-creation of a visual identity, describing a case study of the Regional Center of Excellence for Robotic Technology (CRTA) which opened on 750 m2 in June 2021 within the University of Zagreb. The visual identity of the Center comprises three separately devised elements, each representative of the human-AI relationship and embedded in the institution's logo. Firstly, the letter "C" (from the CRTA acronym) was created using a Gaussian Mixture Model (GMM) applied to (x, y) coordinates that the neurosurgical robot RONNA, CRTA's flagship innovation, generated when hand-guided by a human operator. The second shape of the letter "C" was created by using the same (x, y) coordinates as inputs fed to a neural network whose goal was to output letters in a novel, AI-generated typography. A basic feedforward back-propagating neural network with two hidden layers was chosen for the task. The final and third design element was a trajectory the robot RONNA makes when performing a brain biopsy. As CRTA embodies a state-of-the-art venue for robotics research, the 'interspecies' approach was used to accentuate the importance of human-robot collaboration which is at the core of the newly opened Center, illustrating the potential of reciprocal and amicable relationship that humans could have with technology.
    Nonlinear MPC for Offset-Free Tracking of systems learned by GRU Neural Networks. (arXiv:2103.02383v4 [eess.SY] UPDATED)
    (0 min) The use of Recurrent Neural Networks (RNNs) for system identification has recently gathered increasing attention, thanks to their black-box modeling capabilities.Albeit RNNs have been fruitfully adopted in many applications, only few works are devoted to provide rigorous theoretical foundations that justify their use for control purposes. The aim of this paper is to describe how stable Gated Recurrent Units (GRUs), a particular RNN architecture, can be trained and employed in a Nonlinear MPC framework to perform offset-free tracking of constant references with guaranteed closed-loop stability. The proposed approach is tested on a pH neutralization process benchmark, showing remarkable performances.
    Distributed Image Transmission using Deep Joint Source-Channel Coding. (arXiv:2201.10340v1 [cs.IT])
    (2 min) We study the problem of deep joint source-channel coding (D-JSCC) for correlated image sources, where each source is transmitted through a noisy independent channel to the common receiver. In particular, we consider a pair of images captured by two cameras with probably overlapping fields of view transmitted over wireless channels and reconstructed in the center node. The challenging problem involves designing a practical code to utilize both source and channel correlations to improve transmission efficiency without additional transmission overhead. To tackle this, we need to consider the common information across two stereo images as well as the differences between two transmission channels. In this case, we propose a deep neural networks solution that includes lightweight edge encoders and a powerful center decoder. Besides, in the decoder, we propose a novel channel state information aware cross attention module to highlight the overlapping fields and leverage the relevance between two noisy feature maps.Our results show the impressive improvement of reconstruction quality in both links by exploiting the noisy representations of the other link. Moreover, the proposed scheme shows competitive results compared to the separated schemes with capacity-achieving channel codes.
    GMM Discriminant Analysis with Noisy Label for Each Class. (arXiv:2201.10242v1 [cs.LG])
    (2 min) Real world datasets often contain noisy labels, and learning from such datasets using standard classification approaches may not produce the desired performance. In this paper, we propose a Gaussian Mixture Discriminant Analysis (GMDA) with noisy label for each class. We introduce flipping probability and class probability and use EM algorithms to solve the discriminant problem with label noise. We also provide the detail proofs of convergence. Experimental results on synthetic and real-world datasets show that the proposed approach notably outperforms other four state-of-art methods.
    Transformer-Based Video Front-Ends for Audio-Visual Speech Recognition. (arXiv:2201.10439v1 [cs.CV])
    (2 min) Audio-visual automatic speech recognition (AV-ASR) extends the speech recognition by introducing the video modality. In particular, the information contained in the motion of the speaker's mouth is used to augment the audio features. The video modality is traditionally processed with a 3D convolutional neural network (e.g. 3D version of VGG). Recently, image transformer networks arXiv:2010.11929 demonstrated the ability to extract rich visual features for the image classification task. In this work, we propose to replace the 3D convolution with a video transformer video feature extractor. We train our baselines and the proposed model on a large scale corpus of the YouTube videos. Then we evaluate the performance on a labeled subset of YouTube as well as on the public corpus LRS3-TED. Our best model video-only model achieves the performance of 34.9% WER on YTDEV18 and 19.3% on LRS3-TED which is a 10% and 9% relative improvements over the convolutional baseline. We achieve the state of the art performance of the audio-visual recognition on the LRS3-TED after fine-tuning our model (1.6% WER).
    Differentially Private Temporal Difference Learning with Stochastic Nonconvex-Strongly-Concave Optimization. (arXiv:2201.10447v1 [cs.LG])
    (2 min) Temporal difference (TD) learning is a widely used method to evaluate policies in reinforcement learning. While many TD learning methods have been developed in recent years, little attention has been paid to preserving privacy and most of the existing approaches might face the concerns of data privacy from users. To enable complex representative abilities of policies, in this paper, we consider preserving privacy in TD learning with nonlinear value function approximation. This is challenging because such a nonlinear problem is usually studied in the formulation of stochastic nonconvex-strongly-concave optimization to gain finite-sample analysis, which would require simultaneously preserving the privacy on primal and dual sides. To this end, we employ a momentum-based stochastic gradient descent ascent to achieve a single-timescale algorithm, and achieve a good trade-off between meaningful privacy and utility guarantees of both the primal and dual sides by perturbing the gradients on both sides using well-calibrated Gaussian noises. As a result, our DPTD algorithm could provide $(\epsilon,\delta)$-differential privacy (DP) guarantee for the sensitive information encoded in transitions and retain the original power of TD learning, with the utility upper bounded by $\widetilde{\mathcal{O}}(\frac{(d\log(1/\delta))^{1/8}}{(n\epsilon)^{1/4}})$ (The tilde in this paper hides the log factor.), where $n$ is the trajectory length and $d$ is the dimension. Extensive experiments conducted in OpenAI Gym show the advantages of our proposed algorithm.
    Finding Influential Instances for Distantly Supervised Relation Extraction. (arXiv:2009.09841v2 [cs.LG] UPDATED)
    (2 min) Distant supervision (DS) is a strong way to expand the datasets for enhancing relation extraction (RE) models but often suffers from high label noise. Current works based on attention, reinforcement learning, or GAN are black-box models so they neither provide meaningful interpretation of sample selection in DS nor stability on different domains. On the contrary, this work proposes a novel model-agnostic instance sampling method for DS by influence function (IF), namely REIF. Our method identifies favorable/unfavorable instances in the bag based on IF, then does dynamic instance sampling. We design a fast influence sampling algorithm that reduces the computational complexity from $\mathcal{O}(mn)$ to $\mathcal{O}(1)$, with analyzing its robustness on the selected sampling function. Experiments show that by simply sampling the favorable instances during training, REIF is able to win over a series of baselines that have complicated architectures. We also demonstrate that REIF can support interpretable instance selection.
    COVID-19 Status Forecasting Using New Viral variants and Vaccination Effectiveness Models. (arXiv:2201.10356v1 [cs.LG])
    (2 min) Background: Recently, a high number of daily positive COVID-19 cases have been reported in regions with relatively high vaccination rates; hence, booster vaccination has become necessary. In addition, infections caused by the different variants and correlated factors have not been discussed in depth. With large variabilities and different co-factors, it is difficult to use conventional mathematical models to forecast the incidence of COVID-19. Methods: Machine learning based on long short-term memory was applied to forecasting the time series of new daily positive cases (DPC), serious cases, hospitalized cases, and deaths. Data acquired from regions with high rates of vaccination, such as Israel, were blended with the current data of other regions in Japan to factor in the potential effects of vaccination. The protection provided by symptomatic infection was also considered in terms of the population effectiveness of vaccination as well as the waning protection and ratio and infectivity of viral variants. To represent changes in public behavior, public mobility and interactions through social media were also included in the analysis. Findings: Comparing the observed and estimated new DPC in Tel Aviv, Israel, the parameters characterizing vaccination effectiveness and the waning protection from infection were well estimated; the vaccination effectiveness of the second dose after 5 months and the third dose after two weeks from infection by the delta variant were 0.24 and 0.95, respectively. Using the extracted parameters regarding vaccination effectiveness, new cases in three prefectures of Japan were replicated.
    Sphere2Vec: Multi-Scale Representation Learning over a Spherical Surface for Geospatial Predictions. (arXiv:2201.10489v1 [cs.CV])
    (2 min) Generating learning-friendly representations for points in a 2D space is a fundamental and long-standing problem in machine learning. Recently, multi-scale encoding schemes (such as Space2Vec) were proposed to directly encode any point in 2D space as a high-dimensional vector, and has been successfully applied to various (geo)spatial prediction tasks. However, a map projection distortion problem rises when applying location encoding models to large-scale real-world GPS coordinate datasets (e.g., species images taken all over the world) - all current location encoding models are designed for encoding points in a 2D (Euclidean) space but not on a spherical surface, e.g., earth surface. To solve this problem, we propose a multi-scale location encoding model called Sphere2V ec which directly encodes point coordinates on a spherical surface while avoiding the mapprojection distortion problem. We provide theoretical proof that the Sphere2Vec encoding preserves the spherical surface distance between any two points. We also developed a unified view of distance-reserving encoding on spheres based on the Double Fourier Sphere (DFS). We apply Sphere2V ec to the geo-aware image classification task. Our analysis shows that Sphere2V ec outperforms other 2D space location encoder models especially on the polar regions and data-sparse areas for image classification tasks because of its nature for spherical surface distance preservation.
    A Survey on Bias and Fairness in Machine Learning. (arXiv:1908.09635v3 [cs.LG] UPDATED)
    (2 min) With the widespread use of AI systems and applications in our everyday lives, it is important to take fairness issues into consideration while designing and engineering these types of systems. Such systems can be used in many sensitive environments to make important and life-changing decisions; thus, it is crucial to ensure that the decisions do not reflect discriminatory behavior toward certain groups or populations. We have recently seen work in machine learning, natural language processing, and deep learning that addresses such challenges in different subdomains. With the commercialization of these systems, researchers are becoming aware of the biases that these applications can contain and have attempted to address them. In this survey we investigated different real-world applications that have shown biases in various ways, and we listed different sources of biases that can affect AI applications. We then created a taxonomy for fairness definitions that machine learning researchers have defined in order to avoid the existing bias in AI systems. In addition to that, we examined different domains and subdomains in AI showing what researchers have observed with regard to unfair outcomes in the state-of-the-art methods and how they have tried to address them. There are still many future directions and solutions that can be taken to mitigate the problem of bias in AI systems. We are hoping that this survey will motivate researchers to tackle these issues in the near future by observing existing work in their respective fields.
    Unsupervised Belief Representation Learning in Polarized Networks: A Variational Graph Auto-Encoder Approach. (arXiv:2110.00210v4 [cs.SI] UPDATED)
    (2 min) In this paper, we propose an Information-Theoretic Variational Graph Auto-Encoder (InfoVGAE) for polarity representation learning in an unsupervised manner. It jointly learns the belief embedding of both users and their claims in the same latent space. In order to better disentangle the latent space, a total correlation regularizer, a PI controller, and the rectified Gaussian Distribution are adopted to constrain the generated distribution. Experimental results show that the proposed InfoVGAE outperforms the existing unsupervised polarity detection methods, and achieves a highly comparable result with supervised methods in terms of F1 score and purity.
    Mutual information neural estimation for unsupervised multi-modal registration of brain images. (arXiv:2201.10305v1 [eess.IV])
    (2 min) Many applications in image-guided surgery and therapy require fast and reliable non-linear, multi-modal image registration. Recently proposed unsupervised deep learning-based registration methods have demonstrated superior performance compared to iterative methods in just a fraction of the time. Most of the learning-based methods have focused on mono-modal image registration. The extension to multi-modal registration depends on the use of an appropriate similarity function, such as the mutual information (MI). We propose guiding the training of a deep learning-based registration method with MI estimation between an image-pair in an end-to-end trainable network. Our results show that a small, 2-layer network produces competitive results in both mono- and multimodal registration, with sub-second run-times. Comparisons to both iterative and deep learning-based methods show that our MI-based method produces topologically and qualitatively superior results with an extremely low rate of non-diffeomorphic transformations. Real-time clinical application will benefit from a better visual matching of anatomical structures and less registration failures/outliers.
    Optimal estimation of Gaussian DAG models. (arXiv:2201.10548v1 [math.ST])
    (2 min) We study the optimal sample complexity of learning a Gaussian directed acyclic graph (DAG) from observational data. Our main result establishes the minimax optimal sample complexity for learning the structure of a linear Gaussian DAG model with equal variances to be $n\asymp q\log(d/q)$, where $q$ is the maximum number of parents and $d$ is the number of nodes. We further make comparisons with the classical problem of learning (undirected) Gaussian graphical models, showing that under the equal variance assumption, these two problems share the same optimal sample complexity. In other words, at least for Gaussian models with equal error variances, learning a directed graphical model is not more difficult than learning an undirected graphical model. Our results also extend to more general identification assumptions as well as subgaussian errors.
    Robust Geodesic Regression. (arXiv:2007.04518v3 [stat.ML] UPDATED)
    (0 min) This paper studies robust regression for data on Riemannian manifolds. Geodesic regression is the generalization of linear regression to a setting with a manifold-valued dependent variable and one or more real-valued independent variables. The existing work on geodesic regression uses the sum-of-squared errors to find the solution, but as in the classical Euclidean case, the least-squares method is highly sensitive to outliers. In this paper, we use M-type estimators, including the $L_1$, Huber and Tukey biweight estimators, to perform robust geodesic regression, and describe how to calculate the tuning parameters for the latter two. We also show that, on compact symmetric spaces, all M-type estimators are maximum likelihood estimators, and argue for the overall superiority of the $L_1$ estimator over the $L_2$ and Huber estimators on high-dimensional manifolds and over the Tukey biweight estimator on compact high-dimensional manifolds. Results from numerical examples, including analysis of real neuroimaging data, demonstrate the promising empirical properties of the proposed approach.
    On Structured Filtering-Clustering: Global Error Bound and Optimal First-Order Algorithms. (arXiv:1904.07462v3 [stat.ML] UPDATED)
    (2 min) The filtering-clustering models, including trend filtering and convex clustering, have become an important source of ideas and modeling tools in machine learning and related fields. The statistical guarantee of optimal solutions in these models has been extensively studied yet the investigations on the computational aspect have remained limited. In particular, practitioners often employ the first-order algorithms in real-world applications and are impressed by their superior performance regardless of ill-conditioned structures of difference operator matrices, thus leaving open the problem of understanding the convergence property of first-order algorithms. This paper settles this open problem and contributes to the broad interplay between statistics and optimization by identifying a \textit{global error bound} condition, which is satisfied by a large class of dual filtering-clustering problems, and designing a class of \textit{generalized dual gradient ascent} algorithm, which is \textit{optimal} first-order algorithms in deterministic, finite-sum and online settings. Our results are new and help explain why the filtering-clustering models can be efficiently solved by first-order algorithms. We also provide the detailed convergence rate analysis for the proposed algorithms in different settings, shedding light on their potential to solve the filtering-clustering models efficiently. We also conduct experiments on real datasets and the numerical results demonstrate the effectiveness of our algorithms.
    Explanatory Learning: Beyond Empiricism in Neural Networks. (arXiv:2201.10222v1 [cs.LG])
    (2 min) We introduce Explanatory Learning (EL), a framework to let machines use existing knowledge buried in symbolic sequences -- e.g. explanations written in hieroglyphic -- by autonomously learning to interpret them. In EL, the burden of interpreting symbols is not left to humans or rigid human-coded compilers, as done in Program Synthesis. Rather, EL calls for a learned interpreter, built upon a limited collection of symbolic sequences paired with observations of several phenomena. This interpreter can be used to make predictions on a novel phenomenon given its explanation, and even to find that explanation using only a handful of observations, like human scientists do. We formulate the EL problem as a simple binary classification task, so that common end-to-end approaches aligned with the dominant empiricist view of machine learning could, in principle, solve it. To these models, we oppose Critical Rationalist Networks (CRNs), which instead embrace a rationalist view on the acquisition of knowledge. CRNs express several desired properties by construction, they are truly explainable, can adjust their processing at test-time for harder inferences, and can offer strong confidence guarantees on their predictions. As a final contribution, we introduce Odeen, a basic EL environment that simulates a small flatland-style universe full of phenomena to explain. Using Odeen as a testbed, we show how CRNs outperform empiricist end-to-end approaches of similar size and architecture (Transformers) in discovering explanations for novel phenomena.
    An adaptive closed-loop ECoG decoder for long-term and stable bimanual control of an exoskeleton by a tetraplegic. (arXiv:2201.10449v1 [eess.SP])
    (2 min) Brain-computer interfaces (BCIs) still face many challenges to step out of laboratories to be used in real-life applications. A key one persists in the high performance control of diverse effectors for complex tasks, using chronic and safe recorders. This control must be robust over time and of high decoding performance without continuous recalibration of the decoders. In the article, asynchronous control of an exoskeleton by a tetraplegic patient using a chronically implanted epidural electrocorticography (EpiCoG) implant is demonstrated. For this purpose, an adaptive online tensor-based decoder: the Recursive Exponentially Weighted Markov-Switching multi-Linear Model (REW-MSLM) was developed. We demonstrated over a period of 6 months the stability of the 8-dimensional alternative bimanual control of the exoskeleton and its virtual avatar using REW-MSLM without recalibration of the decoder.
    Addressing the Intra-class Mode Collapse Problem using Adaptive Input Image Normalization in GAN-based X-ray Images. (arXiv:2201.10324v1 [eess.IV])
    (2 min) Biomedical image datasets can be imbalanced due to the rarity of targeted diseases. Generative Adversarial Networks play a key role in addressing this imbalance by enabling the generation of synthetic images to augment and balance datasets. It is important to generate synthetic images that incorporate a diverse range of features such that they accurately represent the distribution of features present in the training imagery. Furthermore, the absence of diverse features in synthetic images can degrade the performance of machine learning classifiers. The mode collapse problem can impact a Generative Adversarial Network's capacity to generate diversified images. The mode collapse comes in two varieties; intra-class and inter-class. In this paper, the intra-class mode collapse problem is investigated, and its subsequent impact on the diversity of synthetic X-ray images is evaluated. This work contributes an empirical demonstration of the benefits of integrating the adaptive input-image normalization for the Deep Convolutional GAN to alleviate the intra-class mode collapse problem. Results demonstrate that the DCGAN with adaptive input-image normalization outperforms DCGAN with un-normalized X-ray images as evident by the superior diversity scores.
    Using Deep Reinforcement Learning for Zero Defect Smart Forging. (arXiv:2201.10268v1 [cs.LG])
    (0 min) Defects during production may lead to material waste, which is a significant challenge for many companies as it reduces revenue and negatively impacts sustainability and the environment. An essential reason for material waste is a low degree of automation, especially in industries that currently have a low degree of digitalization, such as steel forging. Those industries typically rely on heavy and old machinery such as large induction ovens that are mostly controlled manually or using well-known recipes created by experts. However, standard recipes may fail when unforeseen events happen, such as an unplanned stop in production, which may lead to overheating and thus material degradation during the forging process. In this paper, we develop a digital twin-based optimization strategy for the heating process for a forging line to automate the development of an optimal control policy that adjusts the power for the heating coils in an induction oven based on temperature data observed from pyrometers. We design a digital twin-based deep reinforcement learning (DTRL) framework and train two different deep reinforcement learning (DRL) models for the heating phase using a digital twin of the forging line. The twin is based on a simulator that contains a heating transfer and movement model, which is used as an environment for the DRL training. Our evaluation shows that both models significantly reduce the temperature unevenness and can help to automate the traditional heating process.
    A Multi-modal Fusion Framework Based on Multi-task Correlation Learning for Cancer Prognosis Prediction. (arXiv:2201.10353v1 [cs.LG])
    (2 min) Morphological attributes from histopathological images and molecular profiles from genomic data are important information to drive diagnosis, prognosis, and therapy of cancers. By integrating these heterogeneous but complementary data, many multi-modal methods are proposed to study the complex mechanisms of cancers, and most of them achieve comparable or better results from previous single-modal methods. However, these multi-modal methods are restricted to a single task (e.g., survival analysis or grade classification), and thus neglect the correlation between different tasks. In this study, we present a multi-modal fusion framework based on multi-task correlation learning (MultiCoFusion) for survival analysis and cancer grade classification, which combines the power of multiple modalities and multiple tasks. Specifically, a pre-trained ResNet-152 and a sparse graph convolutional network (SGCN) are used to learn the representations of histopathological images and mRNA expression data respectively. Then these representations are fused by a fully connected neural network (FCNN), which is also a multi-task shared network. Finally, the results of survival analysis and cancer grade classification output simultaneously. The framework is trained by an alternate scheme. We systematically evaluate our framework using glioma datasets from The Cancer Genome Atlas (TCGA). Results demonstrate that MultiCoFusion learns better representations than traditional feature extraction methods. With the help of multi-task alternating learning, even simple multi-modal concatenation can achieve better performance than other deep learning and traditional methods. Multi-task learning can improve the performance of multiple tasks not just one of them, and it is effective in both single-modal and multi-modal data.
    Multi-agent Performative Prediction: From Global Stability and Optimality to Chaos. (arXiv:2201.10483v1 [cs.LG])
    (0 min) The recent framework of performative prediction is aimed at capturing settings where predictions influence the target/outcome they want to predict. In this paper, we introduce a natural multi-agent version of this framework, where multiple decision makers try to predict the same outcome. We showcase that such competition can result in interesting phenomena by proving the possibility of phase transitions from stability to instability and eventually chaos. Specifically, we present settings of multi-agent performative prediction where under sufficient conditions their dynamics lead to global stability and optimality. In the opposite direction, when the agents are not sufficiently cautious in their learning/updates rates, we show that instability and in fact formal chaos is possible. We complement our theoretical predictions with simulations showcasing the predictive power of our results.
    PREVIS -- A Combined Machine Learning and Visual Interpolation Approach for Interactive Reverse Engineering in Assembly Quality Control. (arXiv:2201.10257v1 [cs.HC])
    (0 min) We present PREVIS, a visual analytics tool, enhancing machine learning performance analysis in engineering applications. The presented toolchain allows for a direct comparison of regression models. In addition, we provide a methodology to visualize the impact of regression errors on the underlying field of interest in the original domain, the part geometry, via exploiting standard interpolation methods. Further, we allow a real-time preview of user-driven parameter changes in the displacement field via visual interpolation. This allows for fast and accountable online change management. We demonstrate the effectiveness with an ex-ante optimization of an automotive engine hood.
    Design choice and machine learning model performances. (arXiv:2201.10239v1 [stat.ML])
    (2 min) An increasing number of publications present the joint application of Design of Experiments (DOE) and machine learning (ML) as a methodology to collect and analyze data on a specific industrial phenomenon. However, the literature shows that the choice of the design for data collection and model for data analysis is often driven by incidental factors, rather than by statistical or algorithmic advantages, thus there is a lack of studies which provide guidelines on what designs and ML models to jointly use for data collection and analysis. This is the first time in the literature that a paper discusses the choice of design in relation to the ML model performances. An extensive study is conducted that considers 12 experimental designs, 7 families of predictive models, 7 test functions that emulate physical processes, and 8 noise settings, both homoscedastic and heteroscedastic. The results of the research can have an immediate impact on the work of practitioners, providing guidelines for practical applications of DOE and ML.
    BERTHA: Video Captioning Evaluation Via Transfer-Learned Human Assessment. (arXiv:2201.10243v1 [cs.CV])
    (2 min) Evaluating video captioning systems is a challenging task as there are multiple factors to consider; for instance: the fluency of the caption, multiple actions happening in a single scene, and the human bias of what is considered important. Most metrics try to measure how similar the system generated captions are to a single or a set of human-annotated captions. This paper presents a new method based on a deep learning model to evaluate these systems. The model is based on BERT, which is a language model that has been shown to work well in multiple NLP tasks. The aim is for the model to learn to perform an evaluation similar to that of a human. To do so, we use a dataset that contains human evaluations of system generated captions. The dataset consists of the human judgments of the captions produce by the system participating in various years of the TRECVid video to text task. These annotations will be made publicly available. BERTHA obtain favourable results, outperforming the commonly used metrics in some setups.
    Almost Optimal Variance-Constrained Best Arm Identification. (arXiv:2201.10142v1 [cs.LG])
    (0 min) We design and analyze VA-LUCB, a parameter-free algorithm, for identifying the best arm under the fixed-confidence setup and under a stringent constraint that the variance of the chosen arm is strictly smaller than a given threshold. An upper bound on VA-LUCB's sample complexity is shown to be characterized by a fundamental variance-aware hardness quantity $H_{VA}$. By proving a lower bound, we show that sample complexity of VA-LUCB is optimal up to a factor logarithmic in $H_{VA}$. Extensive experiments corroborate the dependence of the sample complexity on the various terms in $H_{VA}$. By comparing VA-LUCB's empirical performance to a close competitor RiskAverse-UCB-BAI by David et al. (2018), our experiments suggest that VA-LUCB has the lowest sample complexity for this class of risk-constrained best arm identification problems, especially for the riskiest instances.
    A Full-Stack Search Technique for Domain Optimized Deep Learning Accelerators. (arXiv:2105.12842v2 [cs.LG] UPDATED)
    (2 min) The rapidly-changing deep learning landscape presents a unique opportunity for building inference accelerators optimized for specific datacenter-scale workloads. We propose Full-stack Accelerator Search Technique (FAST), a hardware accelerator search framework that defines a broad optimization environment covering key design decisions within the hardware-software stack, including hardware datapath, software scheduling, and compiler passes such as operation fusion and tensor padding. In this paper, we analyze bottlenecks in state-of-the-art vision and natural language processing (NLP) models, including EfficientNet and BERT, and use FAST to design accelerators capable of addressing these bottlenecks. FAST-generated accelerators optimized for single workloads improve Perf/TDP by 3.7x on average across all benchmarks compared to TPU-v3. A FAST-generated accelerator optimized for serving a suite of workloads improves Perf/TDP by 2.4x on average compared to TPU-v3. Our return on investment analysis shows that FAST-generated accelerators can potentially be practical for moderate-sized datacenter deployments.
    ML4CO-KIDA: Knowledge Inheritance in Data Aggregation. (arXiv:2201.10328v1 [cs.AI])
    (2 min) The Machine Learning for Combinatorial Optimization (ML4CO) NeurIPS 2021 competition aims to improve state-of-the-art combinatorial optimization solvers by replacing key heuristic components with machine learning models. On the dual task, we design models to make branching decisions to promote the dual bound increase faster. We propose a knowledge inheritance method to generalize knowledge of different models from the dataset aggregation process, named KIDA. Our improvement overcomes some defects of the baseline graph-neural-networks-based methods. Further, we won the $1$\textsuperscript{st} Place on the dual task. We hope this report can provide useful experience for developers and researchers. The code is available at \url{https://github.com/megvii-research/NeurIPS2021-ML4CO-KIDA}
    GIU-GANs: Global Information Utilization for Generative Adversarial Networks. (arXiv:2201.10471v1 [cs.LG])
    (2 min) In recent years, with the rapid development of artificial intelligence, image generation based on deep learning has dramatically advanced. Image generation based on Generative Adversarial Networks (GANs) is a promising study. However, since convolutions are limited by spatial-agnostic and channel-specific, features extracted by traditional GANs based on convolution are constrained. Therefore, GANs are unable to capture any more details per image. On the other hand, straightforwardly stacking of convolutions causes too many parameters and layers in GANs, which will lead to a high risk of overfitting. To overcome the aforementioned limitations, in this paper, we propose a new GANs called Involution Generative Adversarial Networks (GIU-GANs). GIU-GANs leverages a brand new module called the Global Information Utilization (GIU) module, which integrates Squeeze-and-Excitation Networks (SENet) and involution to focus on global information by channel attention mechanism, leading to a higher quality of generated images. Meanwhile, Batch Normalization(BN) inevitably ignores the representation differences among noise sampled by the generator, and thus degrade the generated image quality. Thus we introduce Representative Batch Normalization(RBN) to the GANs architecture for this issue. The CIFAR-10 and CelebA datasets are employed to demonstrate the effectiveness of our proposed model. A large number of experiments prove that our model achieves state-of-the-art competitive performance.
    Combining Commonsense Reasoning and Knowledge Acquisition to Guide Deep Learning in Robotics. (arXiv:2201.10266v1 [cs.AI])
    (2 min) Algorithms based on deep network models are being used for many pattern recognition and decision-making tasks in robotics and AI. Training these models requires a large labeled dataset and considerable computational resources, which are not readily available in many domains. Also, it is difficult to explore the internal representations and reasoning mechanisms of these models. As a step towards addressing the underlying knowledge representation, reasoning, and learning challenges, the architecture described in this paper draws inspiration from research in cognitive systems. As a motivating example, we consider an assistive robot trying to reduce clutter in any given scene by reasoning about the occlusion of objects and stability of object configurations in an image of the scene. In this context, our architecture incrementally learns and revises a grounding of the spatial relations between objects and uses this grounding to extract spatial information from input images. Non-monotonic logical reasoning with this information and incomplete commonsense domain knowledge is used to make decisions about stability and occlusion. For images that cannot be processed by such reasoning, regions relevant to the tasks at hand are automatically identified and used to train deep network models to make the desired decisions. Image regions used to train the deep networks are also used to incrementally acquire previously unknown state constraints that are merged with the existing knowledge for subsequent reasoning. Experimental evaluation performed using simulated and real-world images indicates that in comparison with baselines based just on deep networks, our architecture improves reliability of decision making and reduces the effort involved in training data-driven deep network models.
    Optimal High-order Tensor SVD via Tensor-Train Orthogonal Iteration. (arXiv:2010.02482v2 [math.ST] UPDATED)
    (2 min) This paper studies a general framework for high-order tensor SVD. We propose a new computationally efficient algorithm, tensor-train orthogonal iteration (TTOI), that aims to estimate the low tensor-train rank structure from the noisy high-order tensor observation. The proposed TTOI consists of initialization via TT-SVD (Oseledets, 2011) and new iterative backward/forward updates. We develop the general upper bound on estimation error for TTOI with the support of several new representation lemmas on tensor matricizations. By developing a matching information-theoretic lower bound, we also prove that TTOI achieves the minimax optimality under the spiked tensor model. The merits of the proposed TTOI are illustrated through applications to estimation and dimension reduction of high-order Markov processes, numerical studies, and a real data example on New York City taxi travel records. The software of the proposed algorithm is available online$^6$.
    Identifying a Training-Set Attack's Target Using Renormalized Influence Estimation. (arXiv:2201.10055v1 [cs.LG])
    (2 min) Targeted training-set attacks inject malicious instances into the training set to cause a trained model to mislabel one or more specific test instances. This work proposes the task of target identification, which determines whether a specific test instance is the target of a training-set attack. This can then be combined with adversarial-instance identification to find (and remove) the attack instances, mitigating the attack with minimal impact on other predictions. Rather than focusing on a single attack method or data modality, we build on influence estimation, which quantifies each training instance's contribution to a model's prediction. We show that existing influence estimators' poor practical performance often derives from their over-reliance on instances and iterations with large losses. Our renormalized influence estimators fix this weakness; they far outperform the original ones at identifying influential groups of training examples in both adversarial and non-adversarial settings, even finding up to 100% of adversarial training instances with no clean-data false positives. Target identification then simplifies to detecting test instances with anomalous influence values. We demonstrate our method's generality on backdoor and poisoning attacks across various data domains including text, vision, and speech. Our source code is available at https://github.com/ZaydH/target_identification .
    Zero-Shot Long-Form Voice Cloning with Dynamic Convolution Attention. (arXiv:2201.10375v1 [eess.AS])
    (2 min) With recent advancements in voice cloning, the performance of speech synthesis for a target speaker has been rendered similar to the human level. However, autoregressive voice cloning systems still suffer from text alignment failures, resulting in an inability to synthesize long sentences. In this work, we propose a variant of attention-based text-to-speech system that can reproduce a target voice from a few seconds of reference speech and generalize to very long utterances as well. The proposed system is based on three independently trained components: a speaker encoder, synthesizer and universal vocoder. Generalization to long utterances is realized using an energy-based attention mechanism known as Dynamic Convolution Attention, in combination with a set of modifications proposed for the synthesizer based on Tacotron 2. Moreover, effective zero-shot speaker adaptation is achieved by conditioning both the synthesizer and vocoder on a speaker encoder that has been pretrained on a large corpus of diverse data. We compare several implementations of voice cloning systems in terms of speech naturalness, speaker similarity, alignment consistency and ability to synthesize long utterances, and conclude that the proposed model can produce intelligible synthetic speech for extremely long utterances, while preserving a high extent of naturalness and similarity for short texts.
    On Uniform Boundedness Properties of SGD and its Momentum Variants. (arXiv:2201.10245v1 [cs.LG])
    (2 min) A theoretical, and potentially also practical, problem with stochastic gradient descent is that trajectories may escape to infinity. In this note, we investigate uniform boundedness properties of iterates and function values along the trajectories of the stochastic gradient descent algorithm and its important momentum variant. Under smoothness and $R$-dissipativity of the loss function, we show that broad families of step-sizes, including the widely used step-decay and cosine with (or without) restart step-sizes, result in uniformly bounded iterates and function values. Several important applications that satisfy these assumptions, including phase retrieval problems, Gaussian mixture models and some neural network classifiers, are discussed in detail.
    Improving Proximity Estimation for Contact Tracing using a Multi-channel Approach. (arXiv:2201.10401v1 [cs.NI])
    (2 min) Due to the COVID 19 pandemic, smartphone-based proximity tracing systems became of utmost interest. Many of these systems use Bluetooth Low Energy (BLE) signals to estimate the distance between two persons. The quality of this method depends on many factors and, therefore, does not always deliver accurate results. In this paper, we present a multi-channel approach to improve proximity estimation, and a novel, publicly available dataset that contains matched IEEE 802.11 (2.4 GHz and 5 GHz) and BLE signal strength data, measured in four different environments. We have developed and evaluated a combined classification model based on BLE and IEEE 802.11 signals. Our approach significantly improves the distance estimation and consequently also the contact tracing accuracy. We are able to achieve good results with our approach in everyday public transport scenarios. However, in our implementation based on IEEE 802.11 probe requests, we also encountered privacy problems and limitations due to the consistency and interval at which such probes are sent. We discuss these limitations and sketch how our approach could be improved to make it suitable for real-world deployment.
    Convergence of Invariant Graph Networks. (arXiv:2201.10129v1 [cs.LG])
    (2 min) Although theoretical properties such as expressive power and over-smoothing of graph neural networks (GNN) have been extensively studied recently, its convergence property is a relatively new direction. In this paper, we investigate the convergence of one powerful GNN, Invariant Graph Network (IGN) over graphs sampled from graphons. We first prove the stability of linear layers for general $k$-IGN (of order $k$) based on a novel interpretation of linear equivariant layers. Building upon this result, we prove the convergence of $k$-IGN under the model of \citet{ruiz2020graphon}, where we access the edge weight but the convergence error is measured for graphon inputs. Under the more natural (and more challenging) setting of \citet{keriven2020convergence} where one can only access 0-1 adjacency matrix sampled according to edge probability, we first show a negative result that the convergence of any IGN is not possible. We then obtain the convergence of a subset of IGNs, denoted as IGN-small, after the edge probability estimation. We show that IGN-small still contains function class rich enough that can approximate spectral GNNs arbitrarily well. Lastly, we perform experiments on various graphon models to verify our statements.
    Cold Start Active Learning Strategies in the Context of Imbalanced Classification. (arXiv:2201.10227v1 [cs.LG])
    (2 min) We present novel active learning strategies dedicated to providing a solution to the cold start stage, i.e. initializing the classification of a large set of data with no attached labels. Moreover, proposed strategies are designed to handle an imbalanced context in which random selection is highly inefficient. Specifically, our active learning iterations address label scarcity and imbalance using element scores, combining information extracted from a clustering structure to a label propagation model. The strategy is illustrated by a case study on annotating Twitter content w.r.t. testimonies of a real flood event. We show that our method effectively copes with class imbalance, by boosting the recall of samples from the minority class.
    Tracking and Planning with Spatial World Models. (arXiv:2201.10335v1 [cs.LG])
    (2 min) We introduce a method for real-time navigation and tracking with differentiably rendered world models. Learning models for control has led to impressive results in robotics and computer games, but this success has yet to be extended to vision-based navigation. To address this, we transfer advances in the emergent field of differentiable rendering to model-based control. We do this by planning in a learned 3D spatial world model, combined with a pose estimation algorithm previously used in the context of TSDF fusion, but now tailored to our setting and improved to incorporate agent dynamics. We evaluate over six simulated environments based on complex human-designed floor plans and provide quantitative results. We achieve up to 92% navigation success rate at a frequency of 15 Hz using only image and depth observations under stochastic, continuous dynamics.
    Neural Architecture Search for Spiking Neural Networks. (arXiv:2201.10355v1 [cs.NE])
    (2 min) Spiking Neural Networks (SNNs) have gained huge attention as a potential energy-efficient alternative to conventional Artificial Neural Networks (ANNs) due to their inherent high-sparsity activation. However, most prior SNN methods use ANN-like architectures (e.g., VGG-Net or ResNet), which could provide sub-optimal performance for temporal sequence processing of binary information in SNNs. To address this, in this paper, we introduce a novel Neural Architecture Search (NAS) approach for finding better SNN architectures. Inspired by recent NAS approaches that find the optimal architecture from activation patterns at initialization, we select the architecture that can represent diverse spike activation patterns across different data samples without training. Furthermore, to leverage the temporal correlation among the spikes, we search for feed forward connections as well as backward connections (i.e., temporal feedback connections) between layers. Interestingly, SNASNet found by our search algorithm achieves higher performance with backward connections, demonstrating the importance of designing SNN architecture for suitably using temporal information. We conduct extensive experiments on three image recognition benchmarks where we show that SNASNet achieves state-of-the-art performance with significantly lower timesteps (5 timesteps).
    Zero-Shot Sketch Based Image Retrieval using Graph Transformer. (arXiv:2201.10185v1 [cs.CV])
    (2 min) The performance of a zero-shot sketch-based image retrieval (ZS-SBIR) task is primarily affected by two challenges. The substantial domain gap between image and sketch features needs to be bridged, while at the same time the side information has to be chosen tactfully. Existing literature has shown that varying the semantic side information greatly affects the performance of ZS-SBIR. To this end, we propose a novel graph transformer based zero-shot sketch-based image retrieval (GTZSR) framework for solving ZS-SBIR tasks which uses a novel graph transformer to preserve the topology of the classes in the semantic space and propagates the context-graph of the classes within the embedding features of the visual space. To bridge the domain gap between the visual features, we propose minimizing the Wasserstein distance between images and sketches in a learned domain-shared space. We also propose a novel compatibility loss that further aligns the two visual domains by bridging the domain gap of one class with respect to the domain gap of all other classes in the training set. Experimental results obtained on the extended Sketchy, TU-Berlin, and QuickDraw datasets exhibit sharp improvements over the existing state-of-the-art methods in both ZS-SBIR and generalized ZS-SBIR.
    Convolutional Xformers for Vision. (arXiv:2201.10271v1 [cs.CV])
    (2 min) Vision transformers (ViTs) have found only limited practical use in processing images, in spite of their state-of-the-art accuracy on certain benchmarks. The reason for their limited use include their need for larger training datasets and more computational resources compared to convolutional neural networks (CNNs), owing to the quadratic complexity of their self-attention mechanism. We propose a linear attention-convolution hybrid architecture -- Convolutional X-formers for Vision (CXV) -- to overcome these limitations. We replace the quadratic attention with linear attention mechanisms, such as Performer, Nystr\"omformer, and Linear Transformer, to reduce its GPU usage. Inductive prior for image data is provided by convolutional sub-layers, thereby eliminating the need for class token and positional embeddings used by the ViTs. We also propose a new training method where we use two different optimizers during different phases of training and show that it improves the top-1 image classification accuracy across different architectures. CXV outperforms other architectures, token mixers (e.g. ConvMixer, FNet and MLP Mixer), transformer models (e.g. ViT, CCT, CvT and hybrid Xformers), and ResNets for image classification in scenarios with limited data and GPU resources (cores, RAM, power).
    Little Help Makes a Big Difference: Leveraging Active Learning to Improve Unsupervised Time Series Anomaly Detection. (arXiv:2201.10323v1 [cs.LG])
    (2 min) Key Performance Indicators (KPI), which are essentially time series data, have been widely used to indicate the performance of telecom networks. Based on the given KPIs, a large set of anomaly detection algorithms have been deployed for detecting the unexpected network incidents. Generally, unsupervised anomaly detection algorithms gain more popularity than the supervised ones, due to the fact that labeling KPIs is extremely time- and resource-consuming, and error-prone. However, those unsupervised anomaly detection algorithms often suffer from excessive false alarms, especially in the presence of concept drifts resulting from network re-configurations or maintenance. To tackle this challenge and improve the overall performance of unsupervised anomaly detection algorithms, we propose to use active learning to introduce and benefit from the feedback of operators, who can verify the alarms (both false and true ones) and label the corresponding KPIs with reasonable effort. Specifically, we develop three query strategies to select the most informative and representative samples to label. We also develop an efficient method to update the weights of Isolation Forest and optimally adjust the decision threshold, so as to eventually improve the performance of detection model. The experiments with one public dataset and one proprietary dataset demonstrate that our active learning empowered anomaly detection pipeline could achieve performance gain, in terms of F1-score, more than 50% over the baseline algorithm. It also outperforms the existing active learning based methods by approximately 6%-10%, with significantly reduced budget (the ratio of samples to be labeled).
    Dissipative Hamiltonian Neural Networks: Learning Dissipative and Conservative Dynamics Separately. (arXiv:2201.10085v1 [cs.LG])
    (2 min) Understanding natural symmetries is key to making sense of our complex and ever-changing world. Recent work has shown that neural networks can learn such symmetries directly from data using Hamiltonian Neural Networks (HNNs). But HNNs struggle when trained on datasets where energy is not conserved. In this paper, we ask whether it is possible to identify and decompose conservative and dissipative dynamics simultaneously. We propose Dissipative Hamiltonian Neural Networks (D-HNNs), which parameterize both a Hamiltonian and a Rayleigh dissipation function. Taken together, they represent an implicit Helmholtz decomposition which can separate dissipative effects such as friction from symmetries such as conservation of energy. We train our model to decompose a damped mass-spring system into its friction and inertial terms and then show that this decomposition can be used to predict dynamics for unseen friction coefficients. Then we apply our model to real world data including a large, noisy ocean current dataset where decomposing the velocity field yields useful scientific insights.
    SPIRAL: Self-supervised Perturbation-Invariant Representation Learning for Speech Pre-Training. (arXiv:2201.10207v1 [eess.AS])
    (2 min) We introduce a new approach for speech pre-training named SPIRAL which works by learning denoising representation of perturbed data in a teacher-student framework. Specifically, given a speech utterance, we first feed the utterance to a teacher network to obtain corresponding representation. Then the same utterance is perturbed and fed to a student network. The student network is trained to output representation resembling that of the teacher. At the same time, the teacher network is updated as moving average of student's weights over training steps. In order to prevent representation collapse, we apply an in-utterance contrastive loss as pre-training objective and impose position randomization on the input to the teacher. SPIRAL achieves competitive or better results compared to state-of-the-art speech pre-training method wav2vec 2.0, with significant reduction of training cost (80% for Base model, 65% for Large model). Furthermore, we address the problem of noise-robustness that is critical to real-world speech applications. We propose multi-condition pre-training by perturbing the student's input with various types of additive noise. We demonstrate that multi-condition pre-trained SPIRAL models are more robust to noisy speech (9.0% - 13.3% relative word error rate reduction on real noisy test data), compared to applying multi-condition training solely in the fine-tuning stage. The code will be released after publication.
    MOORe: Model-based Offline-to-Online Reinforcement Learning. (arXiv:2201.10070v1 [cs.LG])
    (2 min) With the success of offline reinforcement learning (RL), offline trained RL policies have the potential to be further improved when deployed online. A smooth transfer of the policy matters in safe real-world deployment. Besides, fast adaptation of the policy plays a vital role in practical online performance improvement. To tackle these challenges, we propose a simple yet efficient algorithm, Model-based Offline-to-Online Reinforcement learning (MOORe), which employs a prioritized sampling scheme that can dynamically adjust the offline and online data for smooth and efficient online adaptation of the policy. We provide a theoretical foundation for our algorithms design. Experiment results on the D4RL benchmark show that our algorithm smoothly transfers from offline to online stages while enabling sample-efficient online adaption, and also significantly outperforms existing methods.
    Analytically Integratable Zero-restlength Springs for Capturing Dynamic Modes unrepresented by Quasistatic Neural Networks. (arXiv:2201.10122v1 [cs.LG])
    (2 min) We present a novel paradigm for modeling certain types of dynamic simulation in real-time with the aid of neural networks. In order to significantly reduce the requirements on data (especially time-dependent data), as well as decrease generalization error, our approach utilizes a data-driven neural network only to capture quasistatic information (instead of dynamic or time-dependent information). Subsequently, we augment our quasistatic neural network (QNN) inference with a (real-time) dynamic simulation layer. Our key insight is that the dynamic modes lost when using a QNN approximation can be captured with a quite simple (and decoupled) zero-restlength spring model, which can be integrated analytically (as opposed to numerically) and thus has no time-step stability restrictions. Additionally, we demonstrate that the spring constitutive parameters can be robustly learned from a surprisingly small amount of dynamic simulation data. Although we illustrate the efficacy of our approach by considering soft-tissue dynamics on animated human bodies, the paradigm is extensible to many different simulation frameworks.
    Bilevel Online Deep Learning in Non-stationary Environment. (arXiv:2201.10236v1 [cs.LG])
    (2 min) Recent years have witnessed enormous progress of online learning. However, a major challenge on the road to artificial agents is concept drift, that is, the data probability distribution would change where the data instance arrives sequentially in a stream fashion, which would lead to catastrophic forgetting and degrade the performance of the model. In this paper, we proposed a new Bilevel Online Deep Learning (BODL) framework, which combine bilevel optimization strategy and online ensemble classifier. In BODL algorithm, we use an ensemble classifier, which use the output of different hidden layers in deep neural network to build multiple base classifiers, the important weights of the base classifiers are updated according to exponential gradient descent method in an online manner. Besides, we apply the similar constraint to overcome the convergence problem of online ensemble framework. Then an effective concept drift detection mechanism utilizing the error rate of classifier is designed to monitor the change of the data probability distribution. When the concept drift is detected, our BODL algorithm can adaptively update the model parameters via bilevel optimization and then circumvent the large drift and encourage positive transfer. Finally, the extensive experiments and ablation studies are conducted on various datasets and the competitive numerical results illustrate that our BODL algorithm is a promising approach.
    PowerGear: Early-Stage Power Estimation in FPGA HLS via Heterogeneous Edge-Centric GNNs. (arXiv:2201.10114v1 [cs.LG])
    (2 min) Power estimation is the basis of many hardware optimization strategies. However, it is still challenging to offer accurate power estimation at an early stage such as high-level synthesis (HLS). In this paper, we propose PowerGear, a graph-learning-assisted power estimation approach for FPGA HLS, which features high accuracy, efficiency and transferability. PowerGear comprises two main components: a graph construction flow and a customized graph neural network (GNN) model. Specifically, in the graph construction flow, we introduce buffer insertion, datapath merging, graph trimming and feature annotation techniques to transform HLS designs into graph-structured data, which encode both intra-operation micro-architectures and inter-operation interconnects annotated with switching activities. Furthermore, we propose a novel power-aware heterogeneous edge-centric GNN model which effectively learns heterogeneous edge semantics and structural properties of the constructed graphs via edge-centric neighborhood aggregation, and fits the formulation of dynamic power. Compared with on-board measurement, PowerGear estimates total and dynamic power for new HLS designs with errors of 3.60% and 8.81%, respectively, which outperforms the prior arts in research and the commercial product Vivado. In addition, PowerGear demonstrates a speedup of 4x over Vivado power estimator. Finally, we present a case study in which PowerGear is exploited to facilitate design space exploration for FPGA HLS, leading to a performance gain of up to 11.2%, compared with methods using state-of-the-art predictive models.
    Towards Cross-Disaster Building Damage Assessment with Graph Convolutional Networks. (arXiv:2201.10395v1 [cs.CV])
    (2 min) In the aftermath of disasters, building damage maps are obtained using change detection to plan rescue operations. Current convolutional neural network approaches do not consider the similarities between neighboring buildings for predicting the damage. We present a novel graph-based building damage detection solution to capture these relationships. Our proposed model architecture learns from both local and neighborhood features to predict building damage. Specifically, we adopt the sample and aggregate graph convolution strategy to learn aggregation functions that generalize to unseen graphs which is essential for alleviating the time needed to obtain predictions for new disasters. Our experiments on the xBD dataset and comparisons with a classical convolutional neural network reveal that while our approach is handicapped by class imbalance, it presents a promising and distinct advantage when it comes to cross-disaster generalization.
    Investigating the impact of free energy based behavior on human in human-agent interaction. (arXiv:2201.10164v1 [cs.HC])
    (2 min) Humans communicate non-verbally by sharing physical rhythms, such as nodding and gestures, to involve each other. This sharing of physicality creates a sense of unity and makes humans feel involved with others. In this paper, we developed a new body motion generation system based on the free-energy principle (FEP), which not only responds passively but also prompts human actions. The proposed system consists of two modules, the sampling module, and the motion selection module. We conducted a subjective experiment to evaluate the "feeling of interacting with the agent" of the FEP based behavior. The results suggested that FEP based behaviors show more "feeling of interacting with the agent". Furthermore, we confirmed that the agent's gestures elicited subject gestures. This result not only reinforces the impression of feeling interaction but could also realization of agents that encourage people to change their behavior.
    Text and Code Embeddings by Contrastive Pre-Training. (arXiv:2201.10005v1 [cs.CL])
    (2 min) Text embeddings are useful features in many applications such as semantic search and computing text similarity. Previous work typically trains models customized for different use cases, varying in dataset choice, training objective and model architecture. In this work, we show that contrastive pre-training on unsupervised data at scale leads to high quality vector representations of text and code. The same unsupervised text embeddings that achieve new state-of-the-art results in linear-probe classification also display impressive semantic search capabilities and sometimes even perform competitively with fine-tuned models. On linear-probe classification accuracy averaging over 7 tasks, our best unsupervised model achieves a relative improvement of 4% and 1.8% over previous best unsupervised and supervised text embedding models respectively. The same text embeddings when evaluated on large-scale semantic search attains a relative improvement of 23.4%, 14.7%, and 10.6% over previous best unsupervised methods on MSMARCO, Natural Questions and TriviaQA benchmarks, respectively. Similarly to text embeddings, we train code embedding models on (text, code) pairs, obtaining a 20.8% relative improvement over prior best work on code search.
    ViT-HGR: Vision Transformer-based Hand Gesture Recognition from High Density Surface EMG Signals. (arXiv:2201.10060v1 [cs.CV])
    (2 min) Recently, there has been a surge of significant interest on application of Deep Learning (DL) models to autonomously perform hand gesture recognition using surface Electromyogram (sEMG) signals. DL models are, however, mainly designed to be applied on sparse sEMG signals. Furthermore, due to their complex structure, typically, we are faced with memory constraints; require large training times and a large number of training samples, and; there is the need to resort to data augmentation and/or transfer learning. In this paper, for the first time (to the best of our knowledge), we investigate and design a Vision Transformer (ViT) based architecture to perform hand gesture recognition from High Density (HD-sEMG) signals. Intuitively speaking, we capitalize on the recent breakthrough role of the transformer architecture in tackling different complex problems together with its potential for employing more input parallelization via its attention mechanism. The proposed Vision Transformer-based Hand Gesture Recognition (ViT-HGR) framework can overcome the aforementioned training time problems and can accurately classify a large number of hand gestures from scratch without any need for data augmentation and/or transfer learning. The efficiency of the proposed ViT-HGR framework is evaluated using a recently-released HD-sEMG dataset consisting of 65 isometric hand gestures. Our experiments with 64-sample (31.25 ms) window size yield average test accuracy of 84.62 +/- 3.07%, where only 78, 210 number of parameters is utilized. The compact structure of the proposed ViT-based ViT-HGR framework (i.e., having significantly reduced number of trainable parameters) shows great potentials for its practical application for prosthetic control.
    Are Commercial Face Detection Models as Biased as Academic Models?. (arXiv:2201.10047v1 [cs.CV])
    (2 min) As facial recognition systems are deployed more widely, scholars and activists have studied their biases and harms. Audits are commonly used to accomplish this and compare the algorithmic facial recognition systems' performance against datasets with various metadata labels about the subjects of the images. Seminal works have found discrepancies in performance by gender expression, age, perceived race, skin type, etc. These studies and audits often examine algorithms which fall into two categories: academic models or commercial models. We present a detailed comparison between academic and commercial face detection systems, specifically examining robustness to noise. We find that state-of-the-art academic face detection models exhibit demographic disparities in their noise robustness, specifically by having statistically significant decreased performance on older individuals and those who present their gender in a masculine manner. When we compare the size of these disparities to that of commercial models, we conclude that commercial models - in contrast to their relatively larger development budget and industry-level fairness commitments - are always as biased or more biased than an academic model.
    Reinforcement Learning Your Way: Agent Characterization through Policy Regularization. (arXiv:2201.10003v1 [cs.LG])
    (2 min) The increased complexity of state-of-the-art reinforcement learning (RL) algorithms have resulted in an opacity that inhibits explainability and understanding. This has led to the development of several post-hoc explainability methods that aim to extract information from learned policies thus aiding explainability. These methods rely on empirical observations of the policy and thus aim to generalize a characterization of agents' behaviour. In this study, we have instead developed a method to imbue a characteristic behaviour into agents' policies through regularization of their objective functions. Our method guides the agents' behaviour during learning which results in an intrinsic characterization; it connects the learning process with model explanation. We provide a formal argument and empirical evidence for the viability of our method. In future work, we intend to employ it to develop agents that optimize individual financial customers' investment portfolios based on their spending personalities.
    Variational Autoencoders for Reliability Optimization in Multi-Access Edge Computing Networks. (arXiv:2201.10032v1 [eess.SY])
    (2 min) Multi-access edge computing (MEC) is viewed as an integral part of future wireless networks to support new applications with stringent service reliability and latency requirements. However, guaranteeing ultra-reliable and low-latency MEC (URLL MEC) is very challenging due to uncertainties of wireless links, limited communications and computing resources, as well as dynamic network traffic. Enabling URLL MEC mandates taking into account the statistics of the end-to-end (E2E) latency and reliability across the wireless and edge computing systems. In this paper, a novel framework is proposed to optimize the reliability of MEC networks by considering the distribution of E2E service delay, encompassing over-the-air transmission and edge computing latency. The proposed framework builds on correlated variational autoencoders (VAEs) to estimate the full distribution of the E2E service delay. Using this result, a new optimization problem based on risk theory is formulated to maximize the network reliability by minimizing the Conditional Value at Risk (CVaR) as a risk measure of the E2E service delay. To solve this problem, a new algorithm is developed to efficiently allocate users' processing tasks to edge computing servers across the MEC network, while considering the statistics of the E2E service delay learned by VAEs. The simulation results show that the proposed scheme outperforms several baselines that do not account for the risk analyses or statistics of the E2E service delay.
    Attention-Based Vandalism Detection in OpenStreetMap. (arXiv:2201.10406v1 [cs.LG])
    (2 min) OpenStreetMap (OSM), a collaborative, crowdsourced Web map, is a unique source of openly available worldwide map data, increasingly adopted in Web applications. Vandalism detection is a critical task to support trust and maintain OSM transparency. This task is remarkably challenging due to the large scale of the dataset, the sheer number of contributors, various vandalism forms, and the lack of annotated data. This paper presents Ovid - a novel attention-based method for vandalism detection in OSM. Ovid relies on a novel neural architecture that adopts a multi-head attention mechanism to summarize information indicating vandalism from OSM changesets effectively. To facilitate automated vandalism detection, we introduce a set of original features that capture changeset, user, and edit information. Furthermore, we extract a dataset of real-world vandalism incidents from the OSM edit history for the first time and provide this dataset as open data. Our evaluation conducted on real-world vandalism data demonstrates the effectiveness of Ovid.
    Learning Resource Allocation Policies from Observational Data with an Application to Homeless Services Delivery. (arXiv:2201.10053v1 [cs.LG])
    (2 min) We study the problem of learning, from observational data, fair and interpretable policies that effectively match heterogeneous individuals to scarce resources of different types. We model this problem as a multi-class multi-server queuing system where both individuals and resources arrive stochastically over time. Each individual, upon arrival, is assigned to a queue where they wait to be matched to a resource. The resources are assigned in a first come first served (FCFS) fashion according to an eligibility structure that encodes the resource types that serve each queue. We propose a methodology based on techniques in modern causal inference to construct the individual queues as well as learn the matching outcomes and provide a mixed-integer optimization (MIO) formulation to optimize the eligibility structure. The MIO problem maximizes policy outcome subject to wait time and fairness constraints. It is very flexible, allowing for additional linear domain constraints. We conduct extensive analyses using synthetic and real-world data. In particular, we evaluate our framework using data from the U.S. Homeless Management Information System (HMIS). We obtain wait times as low as an FCFS policy while improving the rate of exit from homelessness for underserved or vulnerable groups (7% higher for the Black individuals and 15% higher for those below 17 years old) and overall.
    Neural Manifold Clustering and Embedding. (arXiv:2201.10000v1 [cs.LG])
    (2 min) Given a union of non-linear manifolds, non-linear subspace clustering or manifold clustering aims to cluster data points based on manifold structures and also learn to parameterize each manifold as a linear subspace in a feature space. Deep neural networks have the potential to achieve this goal under highly non-linear settings given their large capacity and flexibility. We argue that achieving manifold clustering with neural networks requires two essential ingredients: a domain-specific constraint that ensures the identification of the manifolds, and a learning algorithm for embedding each manifold to a linear subspace in the feature space. This work shows that many constraints can be implemented by data augmentation. For subspace feature learning, Maximum Coding Rate Reduction (MCR$^2$) objective can be used. Putting them together yields {\em Neural Manifold Clustering and Embedding} (NMCE), a novel method for general purpose manifold clustering, which significantly outperforms autoencoder-based deep subspace clustering. Further, on more challenging natural image datasets, NMCE can also outperform other algorithms specifically designed for clustering. Qualitatively, we demonstrate that NMCE learns a meaningful and interpretable feature space. As the formulation of NMCE is closely related to several important Self-supervised learning (SSL) methods, we believe this work can help us build a deeper understanding on SSL representation learning.
    Box Embeddings for the Description Logic EL++. (arXiv:2201.09919v1 [cs.AI])
    (2 min) Recently, various methods for representation learning on Knowledge Bases (KBs) have been developed. However, these approaches either only focus on learning the embeddings of the data-level knowledge (ABox) or exhibit inherent limitations when dealing with the concept-level knowledge (TBox), e.g., not properly modelling the structure of the logical knowledge. We present BoxEL, a geometric KB embedding approach that allows for better capturing logical structure expressed in the theories of Description Logic EL++. BoxEL models concepts in a KB as axis-parallel boxes exhibiting the advantage of intersectional closure, entities as points inside boxes, and relations between concepts/entities as affine transformations. We show theoretical guarantees (soundness) of BoxEL for preserving logical structure. Namely, the trained model of BoxEL embedding with loss 0 is a (logical) model of the KB. Experimental results on subsumption reasoning and a real-world application--protein-protein prediction show that BoxEL outperforms traditional knowledge graph embedding methods as well as state-of-the-art EL++ embedding approaches.
    RecShard: Statistical Feature-Based Memory Optimization for Industry-Scale Neural Recommendation. (arXiv:2201.10095v1 [cs.LG])
    (2 min) We propose RecShard, a fine-grained embedding table (EMB) partitioning and placement technique for deep learning recommendation models (DLRMs). RecShard is designed based on two key observations. First, not all EMBs are equal, nor all rows within an EMB are equal in terms of access patterns. EMBs exhibit distinct memory characteristics, providing performance optimization opportunities for intelligent EMB partitioning and placement across a tiered memory hierarchy. Second, in modern DLRMs, EMBs function as hash tables. As a result, EMBs display interesting phenomena, such as the birthday paradox, leaving EMBs severely under-utilized. RecShard determines an optimal EMB sharding strategy for a set of EMBs based on training data distributions and model characteristics, along with the bandwidth characteristics of the underlying tiered memory hierarchy. In doing so, RecShard achieves over 6 times higher EMB training throughput on average for capacity constrained DLRMs. The throughput increase comes from improved EMB load balance by over 12 times and from the reduced access to the slower memory by over 87 times.
    Guided Generative Protein Design using Regularized Transformers. (arXiv:2201.09948v1 [cs.LG])
    (2 min) The development of powerful natural language models have increased the ability to learn meaningful representations of protein sequences. In addition, advances in high-throughput mutagenesis, directed evolution, and next-generation sequencing have allowed for the accumulation of large amounts of labeled fitness data. Leveraging these two trends, we introduce Regularized Latent Space Optimization (ReLSO), a deep transformer-based autoencoder which is trained to jointly generate sequences as well as predict fitness. Using ReLSO, we explicitly model the underlying sequence-function landscape of large labeled datasets and optimize within latent space using gradient-based methods. Through regularized prediction heads, ReLSO introduces a powerful protein sequence encoder and novel approach for efficient fitness landscape traversal.
    FRAMED: Data-Driven Structural Performance Analysis of Community-Designed Bicycle Frames. (arXiv:2201.10459v1 [cs.LG])
    (2 min) This paper presents a data-driven analysis of the structural performance of 4500 community-designed bicycle frames. We present FRAMED -- a parametric dataset of bicycle frames based on bicycles designed by bicycle practitioners from across the world. To support our data-driven approach, we also provide a dataset of structural performance values such as weight, displacements under load, and safety factors for all the bicycle frame designs. By exploring a diverse design space of frame design parameters and a set of ten competing design objectives, we present an automated way to analyze the structural performance of bicycle frames. Our structural simulations are validated against physical experimentation on bicycle frames. Through our analysis, we highlight overall trends in bicycle frame designs created by community members, study several bicycle frames under different loading conditions, identify non-dominated design candidates that perform well on multiple objectives, and explore correlations between structural objectives. Our analysis shows that over 75\% of bicycle frames created by community members are infeasible, motivating the need for AI agents to support humans in designing bicycles. This work aims to simultaneously serve researchers focusing on bicycle design as well as researchers focusing on the development of data-driven design algorithms, such as surrogate models and Deep Generative Methods. The dataset and code are provided at this http URL
    COVID-19 Detection Using CT Image Based On YOLOv5 Network. (arXiv:2201.09972v1 [eess.IV])
    (2 min) Computer aided diagnosis (CAD) increases diagnosis efficiency, helping doctors providing a quick and confident diagnosis, it has played an important role in the treatment of COVID19. In our task, we solve the problem about abnormality detection and classification. The dataset provided by Kaggle platform and we choose YOLOv5 as our model. We introduce some methods on objective detection in the related work section, the objection detection can be divided into two streams: onestage and two stage. The representational model are Faster RCNN and YOLO series. Then we describe the YOLOv5 model in the detail. Compared Experiments and results are shown in section IV. We choose mean average precision (mAP) as our experiments' metrics, and the higher (mean) mAP is, the better result the model will gain. mAP@0.5 of our YOLOv5s is 0.623 which is 0.157 and 0.101 higher than Faster RCNN and EfficientDet respectively.
    BLDNet: A Semi-supervised Change Detection Building Damage Framework using Graph Convolutional Networks and Urban Domain Knowledge. (arXiv:2201.10389v1 [cs.CV])
    (2 min) Change detection is instrumental to localize damage and understand destruction in disaster informatics. While convolutional neural networks are at the core of recent change detection solutions, we present in this work, BLDNet, a novel graph formulation for building damage change detection and enable learning relationships and representations from both local patterns and non-stationary neighborhoods. More specifically, we use graph convolutional networks to efficiently learn these features in a semi-supervised framework with few annotated data. Additionally, BLDNet formulation allows for the injection of additional contextual building meta-features. We train and benchmark on the xBD dataset to validate the effectiveness of our approach. We also demonstrate on urban data from the 2020 Beirut Port Explosion that performance is improved by incorporating domain knowledge building meta-features.
    A Deep Learning Approach for the Detection of COVID-19 from Chest X-Ray Images using Convolutional Neural Networks. (arXiv:2201.09952v1 [eess.IV])
    (2 min) The COVID-19 (coronavirus) is an ongoing pandemic caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). The virus was first identified in mid-December 2019 in the Hubei province of Wuhan, China and by now has spread throughout the planet with more than 75.5 million confirmed cases and more than 1.67 million deaths. With limited number of COVID-19 test kits available in medical facilities, it is important to develop and implement an automatic detection system as an alternative diagnosis option for COVID-19 detection that can used on a commercial scale. Chest X-ray is the first imaging technique that plays an important role in the diagnosis of COVID-19 disease. Computer vision and deep learning techniques can help in determining COVID-19 virus with Chest X-ray Images. Due to the high availability of large-scale annotated image datasets, great success has been achieved using convolutional neural network for image analysis and classification. In this research, we have proposed a deep convolutional neural network trained on five open access datasets with binary output: Normal and Covid. The performance of the model is compared with four pre-trained convolutional neural network-based models (COVID-Net, ResNet18, ResNet and MobileNet-V2) and it has been seen that the proposed model provides better accuracy on the validation set as compared to the other four pre-trained models. This research work provides promising results which can be further improvise and implement on a commercial scale.
    Bayesian Inference with Nonlinear Generative Models: Comments on Secure Learning. (arXiv:2201.09986v1 [cs.IT])
    (2 min) Unlike the classical linear model, nonlinear generative models have been addressed sparsely in the literature. This work aims to bring attention to these models and their secrecy potential. To this end, we invoke the replica method to derive the asymptotic normalized cross entropy in an inverse probability problem whose generative model is described by a Gaussian random field with a generic covariance function. Our derivations further demonstrate the asymptotic statistical decoupling of Bayesian inference algorithms and specify the decoupled setting for a given nonlinear model. The replica solution depicts that strictly nonlinear models establish an all-or-nothing phase transition: There exists a critical load at which the optimal Bayesian inference changes from being perfect to an uncorrelated learning. This finding leads to design of a new secure coding scheme which achieves the secrecy capacity of the wiretap channel. The proposed coding has a significantly smaller codebook size compared to the random coding scheme of Wyner. This interesting result implies that strictly nonlinear generative models are perfectly secured without any secure coding. We justify this latter statement through the analysis of an illustrative model for perfectly secure and reliable inference.
    Decentralized EM to Learn Gaussian Mixtures from Datasets Distributed by Features. (arXiv:2201.09965v1 [cs.LG])
    (2 min) Expectation Maximization (EM) is the standard method to learn Gaussian mixtures. Yet its classic, centralized form is often infeasible, due to privacy concerns and computational and communication bottlenecks. Prior work dealt with data distributed by examples, horizontal partitioning, but we lack a counterpart for data scattered by features, an increasingly common scheme (e.g. user profiling with data from multiple entities). To fill this gap, we provide an EM-based algorithm to fit Gaussian mixtures to Vertically Partitioned data (VP-EM). In federated learning setups, our algorithm matches the centralized EM fitting of Gaussian mixtures constrained to a subspace. In arbitrary communication graphs, consensus averaging allows VP-EM to run on large peer-to-peer networks as an EM approximation. This mismatch comes from consensus error only, which vanishes exponentially fast with the number of consensus rounds. We demonstrate VP-EM on various topologies for both synthetic and real data, evaluating its approximation of centralized EM and seeing that it outperforms the available benchmark.
    Hydra: A System for Large Multi-Model Deep Learning. (arXiv:2110.08633v3 [cs.DC] UPDATED)
    (2 min) Training deep learning (DL) models that do not fit into the memory of a single GPU is a vexed process, forcing users to procure multiple GPUs to adopt model-parallel execution. Unfortunately, sequential dependencies in neural architectures often block efficient multi-device training, leading to suboptimal performance. We present 'model spilling', a technique aimed at models such as Transformers and CNNs to move groups of layers, or shards, between DRAM and GPU memory, thus enabling arbitrarily large models to be trained even on just one GPU. We then present a set of novel techniques leveraging spilling to raise efficiency for multi-model training workloads such as model selection: a new hybrid of task- and model-parallelism, a new shard scheduling heuristic, and 'double buffering' to hide latency. We prototype our ideas into a system we call HYDRA to support seamless single-model and multi-model training of large DL models. Experiments with real benchmark workloads show that HYDRA is over 7x faster than regular model parallelism and over 50% faster than state-of-the-art industrial tools for pipeline parallelism.
    Prediction of Neonatal Respiratory Distress in Term Babies at Birth from Digital Stethoscope Recorded Chest Sounds. (arXiv:2201.10105v1 [eess.AS])
    (2 min) Neonatal respiratory distress is a common condition that if left untreated, can lead to short- and long-term complications. This paper investigates the usage of digital stethoscope recorded chest sounds taken within 1min post-delivery, to enable early detection and prediction of neonatal respiratory distress. Fifty-one term newborns were included in this study, 9 of whom developed respiratory distress. For each newborn, 1min anterior and posterior recordings were taken. These recordings were pre-processed to remove noisy segments and obtain high-quality heart and lung sounds. The random undersampling boosting (RUSBoost) classifier was then trained on a variety of features, such as power and vital sign features extracted from the heart and lung sounds. The RUSBoost algorithm produced specificity, sensitivity, and accuracy results of 85.0%, 66.7% and 81.8%, respectively.
    Stochastic Coded Federated Learning with Convergence and Privacy Guarantees. (arXiv:2201.10092v1 [cs.LG])
    (2 min) Federated learning (FL) has attracted much attention as a privacy-preserving distributed machine learning framework, where many clients collaboratively train a machine learning model by exchanging model updates with a parameter server instead of sharing their raw data. Nevertheless, FL training suffers from slow convergence and unstable performance due to stragglers caused by the heterogeneous computational resources of clients and fluctuating communication rates. This paper proposes a coded FL framework, namely *stochastic coded federated learning* (SCFL) to mitigate the straggler issue. In the proposed framework, each client generates a privacy-preserving coded dataset by adding additive noise to the random linear combination of its local data. The server collects the coded datasets from all the clients to construct a composite dataset, which helps to compensate for the straggling effect. In the training process, the server as well as clients perform mini-batch stochastic gradient descent (SGD), and the server adds a make-up term in model aggregation to obtain unbiased gradient estimates. We characterize the privacy guarantee by the mutual information differential privacy (MI-DP) and analyze the convergence performance in federated learning. Besides, we demonstrate a privacy-performance tradeoff of the proposed SCFL method by analyzing the influence of the privacy constraint on the convergence rate. Finally, numerical experiments corroborate our analysis and show the benefits of SCFL in achieving fast convergence while preserving data privacy.
    A Hybrid Science-Guided Machine Learning Approach for Modeling and Optimizing Chemical Processes. (arXiv:2112.01475v2 [cs.LG] UPDATED)
    (2 min) This study presents a broad perspective of hybrid process modeling and optimization combining the scientific knowledge and data analytics in bioprocessing and chemical engineering with a science-guided machine learning (SGML) approach. We divide the approach into two major categories. The first refers to the case where a data-based ML model compliments and makes the first-principle science-based model more accurate in prediction, and the second corresponds to the case where scientific knowledge helps make the ML model more scientifically consistent. We present a detailed review of scientific and engineering literature relating to the hybrid SGML approach, and propose a systematic classification of hybrid SGML models. For applying ML to improve science-based models, we present expositions of the sub-categories of direct serial and parallel hybrid modeling and their combinations, inverse modeling, reduced-order modeling, quantifying uncertainty in the process and even discovering governing equations of the process model. For applying scientific principles to improve ML models, we discuss the sub-categories of science-guided design, learning and refinement. For each sub-category, we identify its requirements, advantages and limitations, together with their published and potential areas of applications in bioprocessing and chemical engineering.We also present several examples to illustrate different hybrid SGML methodologies for modeling polymer processes.
    Neural Information Squeezer for Causal Emergence. (arXiv:2201.10154v1 [cs.LG])
    (2 min) The classic studies of causal emergence have revealed that in some Markovian dynamical systems, far stronger causal connections can be found on the higher-level descriptions than the lower-level of the same systems if we coarse-grain the system states in an appropriate way. However, identifying this emergent causality from the data is still a hard problem that has not been solved because the correct coarse-graining strategy can not be found easily. This paper proposes a general machine learning framework called Neural Information Squeezer to automatically extract the effective coarse-graining strategy and the macro-state dynamics, as well as identify causal emergence directly from the time series data. By decomposing a coarse-graining operation into two processes: information conversion and information dropping out, we can not only exactly control the width of the information channel, but also can derive some important properties analytically including the exact expression of the effective information of a macro-dynamics. We also show how our framework can extract the dynamics on different levels and identify causal emergence from the data on several exampled systems.
    Maximizing information from chemical engineering data sets: Applications to machine learning. (arXiv:2201.10035v1 [stat.ML])
    (2 min) It is well-documented how artificial intelligence can have (and already is having) a big impact on chemical engineering. But classical machine learning approaches may be weak for many chemical engineering applications. This review discusses how challenging data characteristics arise in chemical engineering applications. We identify four characteristics of data arising in chemical engineering applications that make applying classical artificial intelligence approaches difficult: (1) high variance, low volume data, (2) low variance, high volume data, (3) noisy/corrupt/missing data, and (4) restricted data with physics-based limitations. For each of these four data characteristics, we discuss applications where these data characteristics arise and show how current chemical engineering research is extending the fields of data science and machine learning to incorporate these challenges. Finally, we identify several challenges for future research.
    Learning Model Checking and the Kernel Trick for Signal Temporal Logic on Stochastic Processes. (arXiv:2201.09928v1 [cs.LO])
    (2 min) We introduce a similarity function on formulae of signal temporal logic (STL). It comes in the form of a kernel function, well known in machine learning as a conceptually and computationally efficient tool. The corresponding kernel trick allows us to circumvent the complicated process of feature extraction, i.e. the (typically manual) effort to identify the decisive properties of formulae so that learning can be applied. We demonstrate this consequence and its advantages on the task of predicting (quantitative) satisfaction of STL formulae on stochastic processes: Using our kernel and the kernel trick, we learn (i) computationally efficiently (ii) a practically precise predictor of satisfaction, (iii) avoiding the difficult task of finding a way to explicitly turn formulae into vectors of numbers in a sensible way. We back the high precision we have achieved in the experiments by a theoretically sound PAC guarantee, ensuring our procedure efficiently delivers a close-to-optimal predictor.
    Deep convolutional neural network for shape optimization using level-set approach. (arXiv:2201.06210v2 [math.OC] UPDATED)
    (2 min) This article presents a reduced-order modeling methodology via deep convolutional neural networks (CNNs) for shape optimization applications. The CNN provides a nonlinear mapping between the shapes and their associated attributes while conserving the equivariance of these attributes to the shape translations. To implicitly represent complex shapes via a CNN-applicable Cartesian structured grid, a level-set method is employed. The CNN-based reduced-order model (ROM) is constructed in a completely data-driven manner thus well suited for non-intrusive applications. We demonstrate our ROM-based shape optimization framework on a gradient-based three-dimensional shape optimization problem to minimize the induced drag of a wing in low-fidelity potential flow. We show a good agreement between ROM-based optimal aerodynamic coefficients and their counterparts obtained via a potential flow solver. The predicted behavior of the optimized shape is consistent with theoretical predictions. We also present the learning mechanism of the deep CNN model in a physically interpretable manner. The CNN-ROM-based shape optimization algorithm exhibits significant computational efficiency compared to the full-order model-based online optimization applications. The proposed algorithm promises to develop a tractable DL-ROM-driven framework for shape optimization and adaptive morphing structures.
    Iterative Activation-based Structured Pruning. (arXiv:2201.09881v1 [cs.LG])
    (2 min) Deploying complex deep learning models on edge devices is challenging because they have substantial compute and memory resource requirements, whereas edge devices' resource budget is limited. To solve this problem, extensive pruning techniques have been proposed for compressing networks. Recent advances based on the Lottery Ticket Hypothesis (LTH) show that iterative model pruning tends to produce smaller and more accurate models. However, LTH research focuses on unstructured pruning, which is hardware-inefficient and difficult to accelerate on hardware platforms. In this paper, we investigate iterative pruning in the context of structured pruning because structurally pruned models map well on commodity hardware. We find that directly applying a structured weight-based pruning technique iteratively, called iterative L1-norm based pruning (ILP), does not produce accurate pruned models. To solve this problem, we propose two activation-based pruning methods, Iterative Activation-based Pruning (IAP) and Adaptive Iterative Activation-based Pruning (AIAP). We observe that, with only 1% accuracy loss, IAP and AIAP achieve 7.75X and 15.88$X compression on LeNet-5, and 1.25X and 1.71X compression on ResNet-50, whereas ILP achieves 4.77X and 1.13X, respectively.
    Challenges in Migrating Imperative Deep Learning Programs to Graph Execution: An Empirical Study. (arXiv:2201.09953v1 [cs.SE])
    (2 min) Efficiency is essential to support responsiveness w.r.t. ever-growing datasets, especially for Deep Learning (DL) systems. DL frameworks have traditionally embraced deferred execution-style DL code that supports symbolic, graph-based Deep Neural Network (DNN) computation. While scalable, such development tends to produce DL code that is error-prone, non-intuitive, and difficult to debug. Consequently, more natural, less error-prone imperative DL frameworks encouraging eager execution have emerged but at the expense of run-time performance. While hybrid approaches aim for the "best of both worlds," the challenges in applying them in the real world are largely unknown. We conduct a data-driven analysis of challenges -- and resultant bugs -- involved in writing reliable yet performant imperative DL code by studying 250 open-source projects, consisting of 19.7 MLOC, along with 470 and 446 manually examined code patches and bug reports, respectively. The results indicate that hybridization: (i) is prone to API misuse, (ii) can result in performance degradation -- the opposite of its intention, and (iii) has limited application due to execution mode incompatibility. We put forth several recommendations, best practices, and anti-patterns for effectively hybridizing imperative DL code, potentially benefiting DL practitioners, API designers, tool developers, and educators.
    AutoMC: Automated Model Compression based on Domain Knowledge and Progressive search strategy. (arXiv:2201.09884v1 [cs.LG])
    (2 min) Model compression methods can reduce model complexity on the premise of maintaining acceptable performance, and thus promote the application of deep neural networks under resource constrained environments. Despite their great success, the selection of suitable compression methods and design of details of the compression scheme are difficult, requiring lots of domain knowledge as support, which is not friendly to non-expert users. To make more users easily access to the model compression scheme that best meet their needs, in this paper, we propose AutoMC, an effective automatic tool for model compression. AutoMC builds the domain knowledge on model compression to deeply understand the characteristics and advantages of each compression method under different settings. In addition, it presents a progressive search strategy to efficiently explore pareto optimal compression scheme according to the learned prior knowledge combined with the historical evaluation information. Extensive experimental results show that AutoMC can provide satisfying compression schemes within short time, demonstrating the effectiveness of AutoMC.
    The Vehicle Trajectory Prediction Based on ResNet and EfficientNet Model. (arXiv:2201.09973v1 [cs.CV])
    (2 min) At present, a major challenge for the application of automatic driving technology is the accurate prediction of vehicle trajectory. With the vigorous development of computer technology and the emergence of convolution depth neural network, the accuracy of prediction results has been improved. But, the depth, width of the network and image resolution are still important reasons that restrict the accuracy of the model and the prediction results. The main innovation of this paper is the combination of RESNET network and efficient net network, which not only greatly increases the network depth, but also comprehensively changes the choice of network width and image resolution, so as to make the model performance better, but also save computing resources as much as possible. The experimental results also show that our proposed model obtains the optimal prediction results. Specifically, the loss value of our method is separately 4 less and 2.1 less than that of resnet and efficientnet method.
    Towards Multi-Objective Statistically Fair Federated Learning. (arXiv:2201.09917v1 [cs.LG])
    (2 min) Federated Learning (FL) has emerged as a result of data ownership and privacy concerns to prevent data from being shared between multiple parties included in a training procedure. Although issues, such as privacy, have gained significant attention in this domain, not much attention has been given to satisfying statistical fairness measures in the FL setting. With this goal in mind, we conduct studies to show that FL is able to satisfy different fairness metrics under different data regimes consisting of different types of clients. More specifically, uncooperative or adversarial clients might contaminate the global FL model by injecting biased or poisoned models due to existing biases in their training datasets. Those biases might be a result of imbalanced training set (Zhang and Zhou 2019), historical biases (Mehrabi et al. 2021a), or poisoned data-points from data poisoning attacks against fairness (Mehrabi et al. 2021b; Solans, Biggio, and Castillo 2020). Thus, we propose a new FL framework that is able to satisfy multiple objectives including various statistical fairness metrics. Through experimentation, we then show the effectiveness of this method comparing it with various baselines, its ability in satisfying different objectives collectively and individually, and its ability in identifying uncooperative or adversarial clients and down-weighing their effect
    BiasedWalk: Learning Global-aware Node Embeddings via Biased Sampling. (arXiv:2201.09882v1 [cs.LG])
    (2 min) Popular node embedding methods such as DeepWalk follow the paradigm of performing random walks on the graph, and then requiring each node to be proximate to those appearing along with it. Though proved to be successful in various tasks, this paradigm reduces a graph with topology to a set of sequential sentences, thus omitting global information. To produce global-aware node embeddings, we propose BiasedWalk, a biased random walk strategy that favors nodes with similar semantics. Empirical evidence suggests BiasedWalk can generally enhance global awareness of the generated embeddings.
    Long-time prediction of nonlinear parametrized dynamical systems by deep learning-based reduced order models. (arXiv:2201.10215v1 [math.NA])
    (2 min) Deep learning-based reduced order models (DL-ROMs) have been recently proposed to overcome common limitations shared by conventional ROMs - built, e.g., exclusively through proper orthogonal decomposition (POD) - when applied to nonlinear time-dependent parametrized PDEs. In particular, POD-DL-ROMs can achieve extreme efficiency in the training stage and faster than real-time performances at testing, thanks to a prior dimensionality reduction through POD and a DL-based prediction framework. Nonetheless, they share with conventional ROMs poor performances regarding time extrapolation tasks. This work aims at taking a further step towards the use of DL algorithms for the efficient numerical approximation of parametrized PDEs by introducing the $\mu t$-POD-LSTM-ROM framework. This novel technique extends the POD-DL-ROM framework by adding a two-fold architecture taking advantage of long short-term memory (LSTM) cells, ultimately allowing long-term prediction of complex systems' evolution, with respect to the training window, for unseen input parameter values. Numerical results show that this recurrent architecture enables the extrapolation for time windows up to 15 times larger than the training time domain, and achieves better testing time performances with respect to the already lightning-fast POD-DL-ROMs.
    A Regularity Theory for Static Schr\"odinger Equations on $\mathbb{R}^d$ in Spectral Barron Spaces. (arXiv:2201.10072v1 [math.AP])
    (2 min) Spectral Barron spaces have received considerable interest recently as it is the natural function space for approximation theory of two-layer neural networks with a dimension-free convergence rate. In this paper we study the regularity of solutions to the whole-space static Schr\"odinger equation in spectral Barron spaces. We prove that if the source of the equation lies in the spectral Barron space $\mathcal{B}^s(\mathbb{R}^d)$ and the potential function admitting a non-negative lower bound decomposes as a positive constant plus a function in $\mathcal{B}^s(\mathbb{R}^d)$, then the solution lies in the spectral Barron space $\mathcal{B}^{s+2}(\mathbb{R}^d)$.
    S2MS: Self-Supervised Learning Driven Multi-Spectral CT Image Enhancement. (arXiv:2201.10294v1 [eess.IV])
    (2 min) Photon counting spectral CT (PCCT) can produce reconstructed attenuation maps in different energy channels, reflecting energy properties of the scanned object. Due to the limited photon numbers and the non-ideal detector response of each energy channel, the reconstructed images usually contain much noise. With the development of Deep Learning (DL) technique, different kinds of DL-based models have been proposed for noise reduction. However, most of the models require clean data set as the training labels, which are not always available in medical imaging field. Inspiring by the similarities of each channel's reconstructed image, we proposed a self-supervised learning based PCCT image enhancement framework via multi-spectral channels (S2MS). In S2MS framework, both the input and output labels are noisy images. Specifically, one single channel image was used as output while images of other single channels and channel-sum image were used as input to train the network, which can fully use the spectral data information without extra cost. The simulation results based on the AAPM Low-dose CT Challenge database showed that the proposed S2MS model can suppress the noise and preserve details more effectively in comparison with the traditional DL models, which has potential to improve the image quality of PCCT in clinical applications.
    The Enforced Transfer: A Novel Domain Adaptation Algorithm. (arXiv:2201.10001v1 [cs.LG])
    (2 min) Existing Domain Adaptation (DA) algorithms train target models and then use the target models to classify all samples in the target dataset. While this approach attempts to address the problem that the source and the target data are from different distributions, it fails to recognize the possibility that, within the target domain, some samples are closer to the distribution of the source domain than the distribution of the target domain. In this paper, we develop a novel DA algorithm, the Enforced Transfer, that deals with this situation. A straightforward but effective idea to deal with this dilemma is to use an out-of-distribution detection algorithm to decide if, during the testing phase, a given sample is closer to the distribution of the source domain, the target domain, or neither. In the first case, this sample is given to a machine learning classifier trained on source samples. In the second case, this sample is given to a machine learning classifier trained on target samples. In the third case, this sample is discarded as neither an ML model trained on source nor an ML model trained on target is suitable to classify it. It is widely known that the first few layers in a neural network extract low-level features, so the aforementioned approach can be extended from classifying samples in three different scenarios to classifying the samples' activations after an empirically determined layer in three different scenarios. The Enforced Transfer implements the idea. On three types of DA tasks, we outperform the state-of-the-art algorithms that we compare against.
    Probability estimation and structured output prediction for learning preferences in last mile delivery. (arXiv:2201.10269v1 [cs.AI])
    (2 min) We study the problem of learning the preferences of drivers and planners in the context of last mile delivery. Given a data set containing historical decisions and delivery locations, the goal is to capture the implicit preferences of the decision-makers. We consider two ways to use the historical data: one is through a probability estimation method that learns transition probabilities between stops (or zones). This is a fast and accurate method, recently studied in a VRP setting. Furthermore, we explore the use of machine learning to infer how to best balance multiple objectives such as distance, probability and penalties. Specifically, we cast the learning problem as a structured output prediction problem, where training is done by repeatedly calling the TSP solver. Another important aspect we consider is that for last-mile delivery, every address is a potential client and hence the data is very sparse. Hence, we propose a two-stage approach that first learns preferences at the zone level in order to compute a zone routing; after which a penalty-based TSP computes the stop routing. Results show that the zone transition probability estimation performs well, and that the structured output prediction learning can improve the results further. We hence showcase a successful combination of both probability estimation and machine learning, all the while using standard TSP solvers, both during learning and to compute the final solution; this means the methodology is applicable to other, real-life, TSP variants, or proprietary solvers.
    Community-based anomaly detection using spectral graph filtering. (arXiv:2201.09936v1 [cs.SI])
    (2 min) Several applications have a community structure where the nodes of the same community share similar attributes. Anomaly or outlier detection in networks is a relevant and widely studied research topic with applications in various domains. Despite a significant amount of anomaly detection frameworks, there is a dearth on the literature of methods that consider both attributed graphs and the community structure of the networks. This paper proposes a community-based anomaly detection algorithm using a spectral graph-based filter that includes the network community structure into the Laplacian matrix adopted as the basis for the Fourier transform. In addition, the choice of the cutoff frequency of the filter considers the number of communities found. In computational experiments, the proposed strategy, called SpecF, showed an outstanding performance in successfully identifying even discrete anomalies. SpecF is better than a baseline disregarding the community structure, especially for networks with a higher community overlapping. Additionally, we present a case study to validate the proposed method to study the dissemination of COVID-19 in the different districts of S\~ao Jos\'e dos Campos, Brazil.
    Novel Blood Pressure Waveform Reconstruction from Photoplethysmography using Cycle Generative Adversarial Networks. (arXiv:2201.09976v1 [cs.LG])
    (2 min) Continuous monitoring of blood pressure (BP)can help individuals manage their chronic diseases such as hypertension, requiring non-invasive measurement methods in free-living conditions. Recent approaches fuse Photoplethysmograph (PPG) and electrocardiographic (ECG) signals using different machine and deep learning approaches to non-invasively estimate BP; however, they fail to reconstruct the complete signal, leading to less accurate models. In this paper, we propose a cycle generative adversarial network (CycleGAN) based approach to extract a BP signal known as ambulatory blood pressure (ABP) from a clean PPG signal. Our approach uses a cycle generative adversarial network that extends theGAN architecture for domain translation, and outperforms state-of-the-art approaches by up to 2x in BP estimation.
    Unboxing the graph: Neural Relational Inference for Mobility Prediction. (arXiv:2201.10307v1 [cs.LG])
    (2 min) Predicting the supply and demand of transport systems is vital for efficient traffic management, control, optimization, and planning. For example, predicting where from/to and when people intend to travel by taxi can support fleet managers to distribute resources; better predicting traffic speeds/congestion allows for pro-active control measures or for users to better choose their paths. Making spatio-temporal predictions is known to be a hard task, but recently Graph Neural Networks (GNNs) have been widely applied on non-euclidean spatial data. However, most GNN models require a predefined graph, and so far, researchers rely on heuristics to generate this graph for the model to use. In this paper, we use Neural Relational Inference to learn the optimal graph for the model. Our approach has several advantages: 1) a Variational Auto Encoder structure allows for the graph to be dynamically determined by the data, potentially changing through time; 2) the encoder structure allows the use of external data in the generation of the graph; 3) it is possible to place Bayesian priors on the generated graphs to encode domain knowledge. We conduct experiments on two datasets, namely the NYC Yellow Taxi and the PEMS road traffic datasets. In both datasets, we outperform benchmarks and show performance comparable to state-of-the-art. Furthermore, we do an in-depth analysis of the learned graphs, providing insights on what kinds of connections GNNs use for spatio-temporal predictions in the transport domain.
    Two heads are better than one: Enhancing medical representations by pre-training over structured and unstructured electronic health records. (arXiv:2201.10113v1 [cs.CL])
    (2 min) The massive context of electronic health records (EHRs) has created enormous potentials for improving healthcare, among which structured (coded) data and unstructured (text) data are two important textual modalities. They do not exist in isolation and can complement each other in most real-life clinical scenarios. Most existing researches in medical informatics, however, either only focus on a particular modality or straightforwardly concatenate the information from different modalities, which ignore the interaction and information sharing between them. To address these issues, we proposed a unified deep learning-based medical pre-trained language model, named UMM-PLM, to automatically learn representative features from multimodal EHRs that consist of both structured data and unstructured data. Specifically, we first developed parallel unimodal information representation modules to capture the unimodal-specific characteristic, where unimodal representations were learned from each data source separately. A cross-modal module was further introduced to model the interactions between different modalities. We pre-trained the model on a large EHRs dataset containing both structured data and unstructured data and verified the effectiveness of the model on three downstream clinical tasks, i.e., medication recommendation, 30-day readmission and ICD coding through extensive experiments. The results demonstrate the power of UMM-PLM compared with benchmark methods and state-of-the-art baselines. Analyses show that UMM-PLM can effectively concern with multimodal textual information and has the potential to provide more comprehensive interpretations for clinical decision making.
  • stat.ML updates on arXiv.org

    On Structured Filtering-Clustering: Global Error Bound and Optimal First-Order Algorithms. (arXiv:1904.07462v3 [stat.ML] UPDATED)
    (2 min) The filtering-clustering models, including trend filtering and convex clustering, have become an important source of ideas and modeling tools in machine learning and related fields. The statistical guarantee of optimal solutions in these models has been extensively studied yet the investigations on the computational aspect have remained limited. In particular, practitioners often employ the first-order algorithms in real-world applications and are impressed by their superior performance regardless of ill-conditioned structures of difference operator matrices, thus leaving open the problem of understanding the convergence property of first-order algorithms. This paper settles this open problem and contributes to the broad interplay between statistics and optimization by identifying a \textit{global error bound} condition, which is satisfied by a large class of dual filtering-clustering problems, and designing a class of \textit{generalized dual gradient ascent} algorithm, which is \textit{optimal} first-order algorithms in deterministic, finite-sum and online settings. Our results are new and help explain why the filtering-clustering models can be efficiently solved by first-order algorithms. We also provide the detailed convergence rate analysis for the proposed algorithms in different settings, shedding light on their potential to solve the filtering-clustering models efficiently. We also conduct experiments on real datasets and the numerical results demonstrate the effectiveness of our algorithms.
    Efficient Approximations of the Fisher Matrix in Neural Networks using Kronecker Product Singular Value Decomposition. (arXiv:2201.10285v1 [cs.NE])
    (2 min) Several studies have shown the ability of natural gradient descent to minimize the objective function more efficiently than ordinary gradient descent based methods. However, the bottleneck of this approach for training deep neural networks lies in the prohibitive cost of solving a large dense linear system corresponding to the Fisher Information Matrix (FIM) at each iteration. This has motivated various approximations of either the exact FIM or the empirical one. The most sophisticated of these is KFAC, which involves a Kronecker-factored block diagonal approximation of the FIM. With only a slight additional cost, a few improvements of KFAC from the standpoint of accuracy are proposed. The common feature of the four novel methods is that they rely on a direct minimization problem, the solution of which can be computed via the Kronecker product singular value decomposition technique. Experimental results on the three standard deep auto-encoder benchmarks showed that they provide more accurate approximations to the FIM. Furthermore, they outperform KFAC and state-of-the-art first-order methods in terms of optimization speed.
    Anomaly detection using principles of human perception. (arXiv:2103.12323v3 [cs.CR] UPDATED)
    (2 min) In the fields of statistics and unsupervised machine learning a fundamental and well-studied problem is anomaly detection. Anomalies are difficult to define, yet many algorithms have been proposed. Underlying the approaches is the nebulous understanding that anomalies are rare, unusual or inconsistent with the majority of data. The present work provides a philosophical treatise to clearly define anomalies and develops an algorithm for their efficient detection with minimal user intervention. Inspired by the Gestalt School of Psychology and the Helmholtz principle of human perception, anomalies are assumed to be observations that are unexpected to occur with respect to certain groupings made by the majority of the data. Under appropriate random variable modelling anomalies are directly found in a set of data by a uniform and independent random assumption of the distribution of constituent elements of the observations, with anomalies corresponding to those observations where the expectation of the number of occurrences of the elements in a given view is $<1$. Starting from fundamental principles of human perception an unsupervised anomaly detection algorithm is developed that is simple, real-time and parameter-free. Experiments suggest it as a competing choice for univariate data with promising results on the detection of global anomalies in multivariate data.
    Zero-Truncated Poisson Regression for Zero-Inflated Multiway Count Data. (arXiv:2201.10014v1 [stat.ME])
    (2 min) We propose a novel statistical inference paradigm for zero-inflated multiway count data that dispenses with the need to distinguish between true and false zero counts. Our approach ignores all zero entries and applies zero-truncated Poisson regression on the positive counts. Inference is accomplished via tensor completion that imposes low-rank structure on the Poisson parameter space. Our main result shows that an $N$-way rank-$R$ parametric tensor $\boldsymbol{\mathscr{M}}\in(0,\infty)^{I\times \cdots\times I}$ generating Poisson observations can be accurately estimated from approximately $IR^2\log_2^2(I)$ non-zero counts for a nonnegative canonical polyadic decomposition. Several numerical experiments are presented demonstrating that our zero-truncated paradigm is comparable to the ideal scenario where the locations of false zero counts are known a priori.
    A deep mixture density network for outlier-corrected interpolation of crowd-sourced weather data. (arXiv:2201.10544v1 [stat.ML])
    (2 min) As the costs of sensors and associated IT infrastructure decreases - as exemplified by the Internet of Things - increasing volumes of observational data are becoming available for use by environmental scientists. However, as the number of available observation sites increases, so too does the opportunity for data quality issues to emerge, particularly given that many of these sensors do not have the benefit of official maintenance teams. To realise the value of crowd sourced 'Internet of Things' type observations for environmental modelling, we require approaches that can automate the detection of outliers during the data modelling process so that they do not contaminate the true distribution of the phenomena of interest. To this end, here we present a Bayesian deep learning approach for spatio-temporal modelling of environmental variables with automatic outlier detection. Our approach implements a Gaussian-uniform mixture density network whose dual purposes - modelling the phenomenon of interest, and learning to classify and ignore outliers - are achieved simultaneously, each by specifically designed branches of our neural network. For our example application, we use the Met Office's Weather Observation Website data, an archive of observations from around 1900 privately run and unofficial weather stations across the British Isles. Using data on surface air temperature, we demonstrate how our deep mixture model approach enables the modelling of a highly skilled spatio-temporal temperature distribution without contamination from spurious observations. We hope that adoption of our approach will help unlock the potential of incorporating a wider range of observation sources, including from crowd sourcing, into future environmental models.
    Robust Geodesic Regression. (arXiv:2007.04518v3 [stat.ML] UPDATED)
    (2 min) This paper studies robust regression for data on Riemannian manifolds. Geodesic regression is the generalization of linear regression to a setting with a manifold-valued dependent variable and one or more real-valued independent variables. The existing work on geodesic regression uses the sum-of-squared errors to find the solution, but as in the classical Euclidean case, the least-squares method is highly sensitive to outliers. In this paper, we use M-type estimators, including the $L_1$, Huber and Tukey biweight estimators, to perform robust geodesic regression, and describe how to calculate the tuning parameters for the latter two. We also show that, on compact symmetric spaces, all M-type estimators are maximum likelihood estimators, and argue for the overall superiority of the $L_1$ estimator over the $L_2$ and Huber estimators on high-dimensional manifolds and over the Tukey biweight estimator on compact high-dimensional manifolds. Results from numerical examples, including analysis of real neuroimaging data, demonstrate the promising empirical properties of the proposed approach.
    Maximizing information from chemical engineering data sets: Applications to machine learning. (arXiv:2201.10035v1 [stat.ML])
    (2 min) It is well-documented how artificial intelligence can have (and already is having) a big impact on chemical engineering. But classical machine learning approaches may be weak for many chemical engineering applications. This review discusses how challenging data characteristics arise in chemical engineering applications. We identify four characteristics of data arising in chemical engineering applications that make applying classical artificial intelligence approaches difficult: (1) high variance, low volume data, (2) low variance, high volume data, (3) noisy/corrupt/missing data, and (4) restricted data with physics-based limitations. For each of these four data characteristics, we discuss applications where these data characteristics arise and show how current chemical engineering research is extending the fields of data science and machine learning to incorporate these challenges. Finally, we identify several challenges for future research.
    Convex Analysis of the Mean Field Langevin Dynamics. (arXiv:2201.10469v1 [stat.ML])
    (2 min) As an example of the nonlinear Fokker-Planck equation, the mean field Langevin dynamics attracts attention due to its connection to (noisy) gradient descent on infinitely wide neural networks in the mean field regime, and hence the convergence property of the dynamics is of great theoretical interest. In this work, we give a simple and self-contained convergence rate analysis of the mean field Langevin dynamics with respect to the (regularized) objective function in both continuous and discrete time settings. The key ingredient of our proof is a proximal Gibbs distribution $p_q$ associated with the dynamics, which, in combination of techniques in [Vempala and Wibisono (2019)], allows us to develop a convergence theory parallel to classical results in convex optimization. Furthermore, we reveal that $p_q$ connects to the duality gap in the empirical risk minimization setting, which enables efficient empirical evaluation of the algorithm convergence.
    Contrastive Representation Distillation. (arXiv:1910.10699v3 [cs.LG] UPDATED)
    (2 min) Often we wish to transfer representational knowledge from one neural network to another. Examples include distilling a large network into a smaller one, transferring knowledge from one sensory modality to a second, or ensembling a collection of models into a single estimator. Knowledge distillation, the standard approach to these problems, minimizes the KL divergence between the probabilistic outputs of a teacher and student network. We demonstrate that this objective ignores important structural knowledge of the teacher network. This motivates an alternative objective by which we train a student to capture significantly more information in the teacher's representation of the data. We formulate this objective as contrastive learning. Experiments demonstrate that our resulting new objective outperforms knowledge distillation and other cutting-edge distillers on a variety of knowledge transfer tasks, including single model compression, ensemble distillation, and cross-modal transfer. Our method sets a new state-of-the-art in many transfer tasks, and sometimes even outperforms the teacher network when combined with knowledge distillation. Code: this http URL
    Optimal estimation of Gaussian DAG models. (arXiv:2201.10548v1 [math.ST])
    (2 min) We study the optimal sample complexity of learning a Gaussian directed acyclic graph (DAG) from observational data. Our main result establishes the minimax optimal sample complexity for learning the structure of a linear Gaussian DAG model with equal variances to be $n\asymp q\log(d/q)$, where $q$ is the maximum number of parents and $d$ is the number of nodes. We further make comparisons with the classical problem of learning (undirected) Gaussian graphical models, showing that under the equal variance assumption, these two problems share the same optimal sample complexity. In other words, at least for Gaussian models with equal error variances, learning a directed graphical model is not more difficult than learning an undirected graphical model. Our results also extend to more general identification assumptions as well as subgaussian errors.
    Transformers Can Do Bayesian Inference. (arXiv:2112.10510v2 [cs.LG] UPDATED)
    (2 min) Currently, it is hard to reap the benefits of deep learning for Bayesian methods, which allow the explicit specification of prior knowledge and accurately capture model uncertainty. We present Prior-Data Fitted Networks (PFNs). PFNs leverage large-scale machine learning techniques to approximate a large set of posteriors. The only requirement for PFNs to work is the ability to sample from a prior distribution over supervised learning tasks (or functions). Our method restates the objective of posterior approximation as a supervised classification problem with a set-valued input: it repeatedly draws a task (or function) from the prior, draws a set of data points and their labels from it, masks one of the labels and learns to make probabilistic predictions for it based on the set-valued input of the rest of the data points. Presented with a set of samples from a new supervised learning task as input, PFNs make probabilistic predictions for arbitrary other data points in a single forward propagation, having learned to approximate Bayesian inference. We demonstrate that PFNs can near-perfectly mimic Gaussian processes and also enable efficient Bayesian inference for intractable problems, with over 200-fold speedups in multiple setups compared to current methods. We obtain strong results in very diverse areas such as Gaussian process regression, Bayesian neural networks, classification for small tabular data sets, and few-shot image classification, demonstrating the generality of PFNs. Code and trained PFNs are released at https://github.com/automl/TransformersCanDoBayesianInference.
    Fast Distributionally Robust Learning with Variance Reduced Min-Max Optimization. (arXiv:2104.13326v2 [cs.LG] UPDATED)
    (2 min) Distributionally robust supervised learning (DRSL) is emerging as a key paradigm for building reliable machine learning systems for real-world applications -- reflecting the need for classifiers and predictive models that are robust to the distribution shifts that arise from phenomena such as selection bias or nonstationarity. Existing algorithms for solving Wasserstein DRSL -- one of the most popular DRSL frameworks based around robustness to perturbations in the Wasserstein distance -- have serious limitations that limit their use in large-scale problems -- in particular they involve solving complex subproblems and they fail to make use of stochastic gradients. We revisit Wasserstein DRSL through the lens of min-max optimization and derive scalable and efficiently implementable stochastic extra-gradient algorithms which provably achieve faster convergence rates than existing approaches. We demonstrate their effectiveness on synthetic and real data when compared to existing DRSL approaches. Key to our results is the use of variance reduction and random reshuffling to accelerate stochastic min-max optimization, the analysis of which may be of independent interest.
    Semi-Supervised Quantile Estimation: Robust and Efficient Inference in High Dimensional Settings. (arXiv:2201.10208v1 [stat.ME])
    (2 min) We consider quantile estimation in a semi-supervised setting, characterized by two available data sets: (i) a small or moderate sized labeled data set containing observations for a response and a set of possibly high dimensional covariates, and (ii) a much larger unlabeled data set where only the covariates are observed. We propose a family of semi-supervised estimators for the response quantile(s) based on the two data sets, to improve the estimation accuracy compared to the supervised estimator, i.e., the sample quantile from the labeled data. These estimators use a flexible imputation strategy applied to the estimating equation along with a debiasing step that allows for full robustness against misspecification of the imputation model. Further, a one-step update strategy is adopted to enable easy implementation of our method and handle the complexity from the non-linear nature of the quantile estimating equation. Under mild assumptions, our estimators are fully robust to the choice of the nuisance imputation model, in the sense of always maintaining root-n consistency and asymptotic normality, while having improved efficiency relative to the supervised estimator. They also attain semi-parametric optimality if the relation between the response and the covariates is correctly specified via the imputation model. As an illustration of estimating the nuisance imputation function, we consider kernel smoothing type estimators on lower dimensional and possibly estimated transformations of the high dimensional covariates, and we establish novel results on their uniform convergence rates in high dimensions, involving responses indexed by a function class and usage of dimension reduction techniques. These results may be of independent interest. Numerical results on both simulated and real data confirm our semi-supervised approach's improved performance, in terms of both estimation and inference.
    GENESIS-V2: Inferring Unordered Object Representations without Iterative Refinement. (arXiv:2104.09958v3 [cs.CV] UPDATED)
    (2 min) Advances in unsupervised learning of object-representations have culminated in the development of a broad range of methods for unsupervised object segmentation and interpretable object-centric scene generation. These methods, however, are limited to simulated and real-world datasets with limited visual complexity. Moreover, object representations are often inferred using RNNs which do not scale well to large images or iterative refinement which avoids imposing an unnatural ordering on objects in an image but requires the a priori initialisation of a fixed number of object representations. In contrast to established paradigms, this work proposes an embedding-based approach in which embeddings of pixels are clustered in a differentiable fashion using a stochastic stick-breaking process. Similar to iterative refinement, this clustering procedure also leads to randomly ordered object representations, but without the need of initialising a fixed number of clusters a priori. This is used to develop a new model, GENESIS-v2, which can infer a variable number of object representations without using RNNs or iterative refinement. We show that GENESIS-v2 performs strongly in comparison to recent baselines in terms of unsupervised image segmentation and object-centric scene generation on established synthetic datasets as well as more complex real-world datasets.
    Almost Optimal Variance-Constrained Best Arm Identification. (arXiv:2201.10142v1 [cs.LG])
    (2 min) We design and analyze VA-LUCB, a parameter-free algorithm, for identifying the best arm under the fixed-confidence setup and under a stringent constraint that the variance of the chosen arm is strictly smaller than a given threshold. An upper bound on VA-LUCB's sample complexity is shown to be characterized by a fundamental variance-aware hardness quantity $H_{VA}$. By proving a lower bound, we show that sample complexity of VA-LUCB is optimal up to a factor logarithmic in $H_{VA}$. Extensive experiments corroborate the dependence of the sample complexity on the various terms in $H_{VA}$. By comparing VA-LUCB's empirical performance to a close competitor RiskAverse-UCB-BAI by David et al. (2018), our experiments suggest that VA-LUCB has the lowest sample complexity for this class of risk-constrained best arm identification problems, especially for the riskiest instances.
    Learning Stable Classifiers by Transferring Unstable Features. (arXiv:2106.07847v3 [cs.LG] UPDATED)
    (2 min) While unbiased machine learning models are essential for many applications, bias is a human-defined concept that can vary across tasks. Given only input-label pairs, algorithms may lack sufficient information to distinguish stable (causal) features from unstable (spurious) features. However, related tasks often share similar biases -- an observation we may leverage to develop stable classifiers in the transfer setting. In this work, we explicitly inform the target classifier about unstable features in the source tasks. Specifically, we derive a representation that encodes the unstable features by contrasting different data environments in the source task. We achieve robustness by clustering data of the target task according to this representation and minimizing the worst-case risk across these clusters. We evaluate our method on both text and image classifications. Empirical results demonstrate that our algorithm is able to maintain robustness on the target task for both synthetically generated environments and real-world environments.
    Design choice and machine learning model performances. (arXiv:2201.10239v1 [stat.ML])
    (2 min) An increasing number of publications present the joint application of Design of Experiments (DOE) and machine learning (ML) as a methodology to collect and analyze data on a specific industrial phenomenon. However, the literature shows that the choice of the design for data collection and model for data analysis is often driven by incidental factors, rather than by statistical or algorithmic advantages, thus there is a lack of studies which provide guidelines on what designs and ML models to jointly use for data collection and analysis. This is the first time in the literature that a paper discusses the choice of design in relation to the ML model performances. An extensive study is conducted that considers 12 experimental designs, 7 families of predictive models, 7 test functions that emulate physical processes, and 8 noise settings, both homoscedastic and heteroscedastic. The results of the research can have an immediate impact on the work of practitioners, providing guidelines for practical applications of DOE and ML.
    Sparse bottleneck neural networks for exploratory non-linear visualization of Patch-seq data. (arXiv:2006.10411v3 [cs.LG] UPDATED)
    (2 min) Patch-seq, a recently developed experimental technique, allows neuroscientists to obtain transcriptomic and electrophysiological information from the same neurons. Efficiently analyzing and visualizing such paired multivariate data in order to extract biologically meaningful interpretations has, however, remained a challenge. Here, we use sparse deep neural networks with and without a two-dimensional bottleneck to predict electrophysiological features from the transcriptomic ones using a group lasso penalty, yielding concise and biologically interpretable two-dimensional visualizations. In two large example data sets, this visualization reveals known neural classes and their marker genes without biological prior knowledge. We also demonstrate that our method is applicable to other kinds of multimodal data, such as paired transcriptomic and proteomic measurements provided by CITE-seq.
    A Class of Conjugate Priors for Multinomial Probit Models which Includes the Multivariate Normal One. (arXiv:2007.06944v2 [stat.ME] UPDATED)
    (2 min) Multinomial probit models are routinely-implemented representations for learning how the class probabilities of categorical response data change with p observed predictors. Although several frequentist methods have been developed for estimation, inference and classification within such a class of models, Bayesian inference is still lagging behind. This is due to the apparent absence of a tractable class of conjugate priors, that may facilitate posterior inference on the multinomial probit coefficients. Such an issue has motivated increasing efforts toward the development of effective Markov chain Monte Carlo methods, but state-of-the-art solutions still face severe computational bottlenecks, especially in high dimensions. In this article, we show that the entire class of unified skew-normal (SUN) distributions is conjugate to several multinomial probit models. Leveraging this result and the SUN properties, we improve upon state-of-the-art solutions for posterior inference and classification both in terms of closed-form results for several functionals of interest, and also by developing novel computational methods relying either on independent and identically distributed samples from the exact posterior or on scalable and accurate variational approximations based on blocked partially-factorized representations. As illustrated in simulations and in a gastrointestinal lesions application, the magnitude of the improvements relative to current methods is particularly evident, in practice, when the focus is on high-dimensional studies.
    Bayesian Inference with Nonlinear Generative Models: Comments on Secure Learning. (arXiv:2201.09986v1 [cs.IT])
    (2 min) Unlike the classical linear model, nonlinear generative models have been addressed sparsely in the literature. This work aims to bring attention to these models and their secrecy potential. To this end, we invoke the replica method to derive the asymptotic normalized cross entropy in an inverse probability problem whose generative model is described by a Gaussian random field with a generic covariance function. Our derivations further demonstrate the asymptotic statistical decoupling of Bayesian inference algorithms and specify the decoupled setting for a given nonlinear model. The replica solution depicts that strictly nonlinear models establish an all-or-nothing phase transition: There exists a critical load at which the optimal Bayesian inference changes from being perfect to an uncorrelated learning. This finding leads to design of a new secure coding scheme which achieves the secrecy capacity of the wiretap channel. The proposed coding has a significantly smaller codebook size compared to the random coding scheme of Wyner. This interesting result implies that strictly nonlinear generative models are perfectly secured without any secure coding. We justify this latter statement through the analysis of an illustrative model for perfectly secure and reliable inference.
  • Google AI Blog

    Does Your Medical Image Classifier Know What It Doesn’t Know?
    (9 min) Posted by Abhijit Guha Roy, Research Software Engineer and Jie Ren, Research Scientist, Google Research Deep machine learning (ML) systems have achieved considerable success in medical image analysis in recent years. One major contributing factor is access to abundant labeled datasets, which are used to train highly effective supervised deep learning models. However, in the real-world, these models may encounter samples exhibiting rare conditions that are individually too infrequent for per-condition classification. Nevertheless, such conditions can be collectively common because they follow a long-tail distribution and when taken together can represent a significant portion of cases — e.g., in a recent deep learning dermatological study, hundreds of rare conditions composed around 20%…
  • Data Science Central

    Common Reasons Why Organizations Fail to Implement SAP SuccessFactors
    (5 min) An organization has to go through a lot before finding the ideal human resource management solution suitable for their business. But let’s say after watching tons of demos, scoping, and thorough evaluation, an organization finally decides to go with SAP SuccessFactors for apparent reasons. However, selecting the ideal solution does not mean that things will… Read More »Common Reasons Why Organizations Fail to Implement SAP SuccessFactors The post Common Reasons Why Organizations Fail to Implement SAP SuccessFactors appeared first on Data Science Central.
    Benefits of Using Kafka to Handle Real-time Data Streams
    (4 min) Enterprises collect and use more data than ever before. To make the most out of all this data, they build complex data pipelines, most of which are capable of handling over a million transactions per minute. To meet this demand, enterprises need to reliably handle high throughput data feeds with low latencies. This is because… Read More »Benefits of Using Kafka to Handle Real-time Data Streams The post Benefits of Using Kafka to Handle Real-time Data Streams appeared first on Data Science Central.
    How Data Matching Helps Overcome The Impacts of Bad Data in AI
    (4 min) Every passing day, businesses are adapting to evolving processes introduced by full or partial automation of systems. A study suggests that ML solutions have begun demonstrating better results as compared to human statisticians when it comes to data preparation. However, before the world takes that eventual final leap into the limitless possibilities of artificial intelligence,… Read More »How Data Matching Helps Overcome The Impacts of Bad Data in AI The post How Data Matching Helps Overcome The Impacts of Bad Data in AI appeared first on Data Science Central.
  • The Official NVIDIA Blog

    Nearly 80 Percent of Financial Firms Use AI to Improve Services, Reduce Fraud
    (3 min) From the largest firms trading on Wall Street to banks providing customers with fraud protection to fintechs recommending best-fit products to consumers, AI is driving innovation across the financial services industry. New research from NVIDIA found that 78 percent of financial services professionals state that their company uses accelerated computing to deliver AI-enabled applications through Read article > The post Nearly 80 Percent of Financial Firms Use AI to Improve Services, Reduce Fraud appeared first on The Official NVIDIA Blog.
    Let Me Upgrade You: GeForce NOW Adds Resolution Upscaling and More This GFN Thursday
    (4 min) GeForce NOW is taking cloud gaming to new heights. This GFN Thursday delivers an upgraded streaming experience as part of an update that is now available to all members. It includes new resolution upscaling options to make members’ gaming experiences sharper, plus the ability to customize streaming settings in session. The GeForce NOW app is Read article > The post Let Me Upgrade You: GeForce NOW Adds Resolution Upscaling and More This GFN Thursday appeared first on The Official NVIDIA Blog.
  • MIT News - Artificial intelligence

    Where did that sound come from?
    (7 min) MIT neuroscientists have developed a computer model that can answer that question as well as the human brain.
    Demystifying machine-learning systems
    (7 min) A new method automatically describes, in natural language, what the individual components of a neural network do.
  • AWS Machine Learning Blog

    How Clearly accurately predicts fraudulent orders using Amazon Fraud Detector
    (5 min) This post was cowritten by Ziv Pollak, Machine Learning Team Lead, and Sarvi Loloei, Machine Learning Engineer at Clearly. The content and opinions in this post are those of the third-party authors and AWS is not responsible for the content or accuracy of this post. A pioneer in online shopping, Clearly launched their first site […]
  • Becoming Human: Artificial Intelligence Magazine - Medium

    Supervised vs Unsupervised Machine Learning
    (5 min) Understanding the Difference Between Supervised vs Unsupervised Machine Learning
  • OpenAI

    Aligning Language Models to Follow Instructions
    (11 min) We’ve trained language models that are much better at following user intentions than GPT-3 while also making them more truthful and less toxic, using techniques developed through our alignment research. These InstructGPT models, which are trained with humans in the loop, are now deployed as the default language
  • John D. Cook

    Information in a discretized normal
    (2 min) Here’s a problem that came up in my work today. Suppose you have a normal random variable X and you observe X in n discrete bins. The first bin is the left α tail, the last bin is the right α tail, and the range between the two tails is divided evenly into n-2 intervals. […] Information in a discretized normal first appeared on John D. Cook.
  • Artificial Intelligence

    Does anyone else find it annoying how Musk and far right pandering lunatics like Lex Fridman are often seen as the public face of AI? I don't trust these people.
    (5 min) ​ I was just reminded of how annoying it is to me after seeing Musk and Lex tweet in support of the racist, antisemitic antivax Canadian"Truckers for Freedom". Lex Fridman in particular you see everywhere online and in AI circles and is viewed as some AI publicist, especially among those who want to get into the field. Lex is good friends with Michael Malice a well known troll, conspiracy theorist and frequent Fox News guest. Lex is also blatantly obsessed with and a shill for Joe Rogan. Malice is close friends with Alex Jones and Mike Cernovich (who also has hosted and pals around with Lauren Southern (Nazi), insane republicans like Madison Cawthorn, etc). I think it's bad for AI in general. Why these two assholes get to be the face of AI and act like they represent it and interview all the top people. Yann LeCunn, Andrew Ng, Musk, Zuckerberg and many, many more. It sucks. I just want AI to be a welcoming community without being represented by any one person, especially someone with reprehensible, far right views. No wonder AI has a bad reputation. This isn't going to get any better at this rate. And the sad thing is out of all the people, professors and students I have worked with in Math, AI and Programming I have rarely ran into bigoted or conspiracy type people in person. Do you really trust people like Lex to be responsible and try to avoid bias in AI? They don't even believe it exists. If anything, they would be implementing these algorithms (i.e palantir). Update: And if you want proof just look at how much hate I am getting, it is almost like it's too late. Just look at their post or comment history. Most of them are just Tesla fanboys who think Musk is a god and will defend his honor. submitted by /u/TrickyRackets [link] [comments]
    [NLP] Article Summarizing
    (0 min) Which language model best summarizes the articles? I'm specifically looking for a language model that is multilingual. Thanks in advance. submitted by /u/Satrayn [link] [comments]
    CMU’s Latest Machine Learning Research Analyzes and Improves Spectral Normalization In GANs
    (1 min) GANs (generative adversarial networks) are cutting-edge deep generative models that are best known for producing high-resolution, photorealistic photographs. The goal of GANs is to generate random samples from a target data distribution with only a small set of training examples available. This is accomplished by learning two functions: a generator G that maps random input noise to a generated sample, and a discriminator D that attempts to categorize input samples as accurate (i.e., from the training dataset) or fake (i.e., not from the training dataset) (i.e., produced by the generator). Despite its success in enhancing the sample quality of data-driven generative models, GANs’ adversarial training adds to instability. Small changes in hyperparameters, as well as randomness in the optimization process, might cause training to fail. Different architectures, loss functions, and various forms of regularizations/normalizations have all been presented as ways to improve the stability of GANs. Continue Reading Paper: https://arxiv.org/pdf/2009.02773.pdf Github: https://github.com/fjxmlzn/BSN CMU Blog: https://blog.ml.cmu.edu/2022/01/21/why-spectral-normalization-stabilizes-gans-analysis-and-improvements/ submitted by /u/techsucker [link] [comments]
    The understanding debate
    (0 min) submitted by /u/bendee983 [link] [comments]
    Who playing voyage?
    (0 min) submitted by /u/roblox22y [link] [comments]
    Artificial Nightmares : Lilith Daughter of Hatred || Pytti VQGAN AI Art Video [4K 60 FPS]
    (0 min) submitted by /u/Thenamessd [link] [comments]
    Kedro, A Python Machine Learning Pipeline Framework Developed By McKinsey, Has Been Donated To The Linux Foundation
    (0 min) submitted by /u/techsucker [link] [comments]
    Multi Agent Task Planning
    (1 min) Hello, I am looking for any good suggestions on algorithms/techniques/papers for optimally allocating agents to certain tasks. Without getting in to too much detail, I basically have a number of agents with potentially differing roles and abilities, and want to find a way to have all agents work together when needed to accomplish certain tasks (like delivering a package or assembling a part). Theres obviously ways to do this for each agent independently, but I would like for the "swarm" to take advantage of any shared knowledge, and produce a more intelligent output. I cant seem to find a general consensus as to how to go about this. For example, for path planning, there are many well defined algorithms like A-Star, RRT, etc., but for distributed decision making, it seems to not be the case. Thanks! submitted by /u/Throwawaynumber2010 [link] [comments]
    A potential strange mystery. I found a dodgy looking site that seems to be written by a very good quality language model. Can someone help me figure out if it is a language model, or some other elaborate scheme.
    (2 min) Was doing some research on a random topic. Found a website that had a QA structure to it, the answers seemed legit. The site is deep in the uncanny valley, if you can call it that. Having used GPT3, it seemed to be a very well tuned GPT3 model giving, seemingly, factual and accurate responses. ​ What surprised me is the sheer volume of daily QA and how indepth they are. The daily volume clearly can not be done by human labour, a non-AI algo would have a hard time pulling the structure of the citation links for each article, they get deep and thorough! But here is the kicker. The site started in November and every day it's been adding at a very steady clip. ​ Imagine if the Google search algo became sentient and said, "F*ck it, THIS is how I will organize your damn internet! and no pictures! Now leave me alone". That's the feeling I get from it. Thing is, IF this is indeed an AI writing the site, then the implication is, future AIs may learn off another AI. This is the main reason I want to find out if this site is indeed written by an AI. I am not going to share the site link or how to find it via search, for 3 reasons: 1 - not to make the moderators angry with me as a newbie to sub. 2 - not to give the site a back link, that has the potential of delivering volume (I'm suspicious of the human element behind this site, they are anonymous, no about us page or anything) 3 - not give ideas for how to search, thus potentially giving another signal through search terms. (to the real AI running our world) If you want to help me figure out the story, DM me and I'll share the link. It may turn out to a whole lot of nothing! My money is on Wu Dau 2.0 ​ Also, I realize my choice, kinda screams scam, so can I ask a moderator to DM me and check me out and my story. Just trying to figure out the story behind this, and it's possible implications. submitted by /u/rsibs10 [link] [comments]
  • Machine Learning

    [D] Tools alike to arxiv-sanity for finding new papers/creating one's library.
    (1 min) I was trying to find a tool to track papers and create my own library. I found http://www.arxiv-sanity.com/ , which seemed perfect but it turns out it's down, there is also https://arxiv-sanity-lite.com/ which unfortunately is not the same as arxiv-sanity. Before I had found tools such as artigo which allow people to annotate papers and share with others, or Connected papers which produces a graph of related papers, but unfortunately does not have the option to create one's library/annotate. So, would you mind sharing any tool you find useful for paper discovery, tracking, annotation etc? I am asking because I have been looking at many papers lately. I wasn't very organized at first and now I forgot which papers I skimmed, the main points of each papers, and also realized there are many papers I had no idea of for a long time that are relevant to my interest. submitted by /u/carlml [link] [comments]
    [Research] Is there work on detecting freebooted videos?
    (1 min) I have been trying to find work related to detecting stolen videos, specifically things like reaction videos where the person is simply filming themselves watching a video or when people modulate the audio or video to get past the YouTube detection system. I was thinking this would be related to something like similarity metrics for videos/transcripts but I can't seem to find any papers related to this topic specifically. Anyone have any pointers where to look? I have been doing computer vision research for a few years but I am a bit out of date on the recent work in video tasks. submitted by /u/vexillology-nerd [link] [comments]
    [D] Best way and hassle free approach to loading T5 model on multiple GPUs?
    (1 min) Does anybody know how to load this large model (https://github.com/google-research/text-to-text-transfer-transformer) on multiple GPUs. So load some layers on some GPUs and other layers on other GPUs on the same machine. This model needs about 100GB of memory and though I'm trying to use model.parallelize(), I still run into Cuda out of memory error. Any assistance would be greatly appreciated! submitted by /u/rirhun [link] [comments]
    [R] - Multilingual Text Data for Training
    (1 min) Hi all. I’m looking for around 5000 files that I can download for some training. I am hoping these models will be document files made up of content in a multitude of languages. Any ideas of a source I can use? Thanks submitted by /u/w3aryb0arpig [link] [comments]
    [N] OpenAI raises a $250 million Series A
    (1 min) Source: https://twitter.com/sama/status/1486497281884897281?s=20 After our pre-friends-and-family round in 2016, our F&F round in 2017, our angel round in 2018, our pre-seed round in 2019, our seed round in 2020, and our seed extension in 2021, we're delighted to share we’ve raised a Series A of $250 million. Humbled by such a strong start. submitted by /u/rantana [link] [comments]
    [R] With all the data2vec hype, has anybody related it to Zero-Shot?
    (0 min) I did not find such a thought, but wouldn't it be great to use it for solving zero-shot tasks? One can just get rid of the image input with data2vec, right? submitted by /u/MrLeylo [link] [comments]
    [R] Gelman, Hill, and Vehtari's "Regression and Other Stories" is now free as a pdf from the authors!
    (1 min) Link: https://users.aalto.fi/~ave/ROS.pdf tl;dr: The GOAT applied regression book (Gelman and Hill), has been updated after 15 years, and it's free! Blurb: "Many textbooks on regression focus on theory and the simplest of examples. Real statistical problems, however, are complex and subtle. This is not a book about the theory of regression. It is a book about how to use regression to solve real problems of comparison, estimation, prediction, and causal inference. It focuses on practical issues such as sample size and missing data and a wide range of goals and techniques. It jumps right in to methods and computer code you can use fresh out of the box." submitted by /u/bikeskata [link] [comments]
    [D] Looking for a paper
    (0 min) I'm trying to find a paper regarding spatial embeddings which reformulated it as finding embeddings with a good cross-correlation (e.g. gaussian). From what I remember they simply used some gaussian noise with a smoothing kernels over a grid to get their embeddings. submitted by /u/OPKatten [link] [comments]
    [D] How does SHAP work for seq2seq models?
    (1 min) Suppose we have a seq2seq function s(x_1,x_2,x_3) -> y_1 y_2 y_3, how exactly is SHAP defined in this case? Specifically, what exactly is the function f defined in Theorem 1 in the paper? https://arxiv.org/pdf/1705.07874.pdf I am guessing we first fix a y_i, i={1,2,3} and we define f(x) = 1 if y_i is in the output of s(x), 0 otherwise, where x = x_1, x_2, x_3. Is that correct? Thanks! submitted by /u/unital [link] [comments]
    [D]Spatial value assessment based on known reference points
    (1 min) Hi, I am trying to create a model to (probabilistically) determine the value in space according to the values at known reference points. Specifically, based on soil samples at known points, I would like to make some estimate of the value for every point. Below is an example picture: Soil Nitrogen Sampling Intuitively, I would say that at the arbitrary point, the value is the average sum of the known reference point values inversely proportional to the distances. Does anyone know if there exists any known algorithm suitable for value assessment based on known reference points, or is the aforementioned reasoning a satisfying approximation? submitted by /u/ThickDoctor007 [link] [comments]
    [N] Zod.TV is now offering free GPU compute power for Machine Learning !
    (1 min) Zod.TV is an upcoming containerised GPUs platform for Video Transcoding, Live Streaming and Machine Learning. ​ We focus on user experience, simplicity and competitive pricing. Please visit our website for more information and instruction how to use our GPUs https://zod.tv ​ https://preview.redd.it/sg4goc87c7e81.jpg?width=1794&format=pjpg&auto=webp&s=5b427a9b9293b4cac8f12784e15f743dc11d1e8b ​ Our community: Discord: https://discord.com/invite/jjBSjSF Telegram: https://t.me/zodtv8 Instagram: https://instagram.com/zod.tv8 submitted by /u/vietanh2901 [link] [comments]
    [D] Which simulation platform is most common for experimenting with AI
    (2 min) I'm pretty sure many people use Unity for that purpose, but I was interested if there's any platform that's oriented on this subject, I'm trying to find a best platform which wouldn't make me spend much time on creating environment, but rather allows me to quickly deal with environment so that I can focus on ml related staff. Maybe there's anything industry standard? This environment is pretty simple, but just for an example: https://www.youtube.com/watch?v=Lu56xVlZ40M if I were to build I wouldn't want to spend too much time on building environment submitted by /u/mefistofeli [link] [comments]
    [R] ML & NLP Reasearch Highlights of 2021 - by Sebastian Ruder
    (1 min) ML and NLP Research Highlights of 2021 by Sebastian Ruder, actually research scientist at Google in London, ex-DeepMind: Universal Models, Massive Multi-task Learning, Beyond the Transformer (aka cross-attention), Prompting, Efficient Methods, Benchmarking, Conditional Image Generation, ML for Science, Code Synthesis, Bias, Retrieval Augmentation, Token-free Models, Temporal Adaptation, Importance of Data and Meta-learning. submitted by /u/ClaudeCoulombe [link] [comments]
    [D] Cluster Validity Indexes for Fuzzy c-means clustering
    (0 min) What are the good cluster validity indexes for fuzzy c-means clustering? submitted by /u/ManyStomach8923 [link] [comments]
    [Project] AI facial editing models are getting so advanced it will be insanely hard to tell facts from fiction! 🤯🤯(video below: Kamala Harris, Vice President 🇺🇸 smiling when in the actual video she wasn't. In politics, smallest gestures have biggest implications)
    (1 min) AI facial editing models are getting so advanced it will be insanely hard to tell facts from fiction! 🤯🤯(video below: Kamala Harris, Vice President 🇺🇸 smiling when in the actual video she wasn't. In politics, smallest gestures have biggest implications) paper link submitted by /u/cv2020br [link] [comments]
  • Reinforcement Learning

    "MLGO: a Machine Learning Guided Compiler Optimizations Framework", Trofin et al 2022 (tuning LLVM to reduce codesize by 5%)
    (1 min) submitted by /u/gwern [link] [comments]
    Reinforcement learning for images
    (1 min) Can we use reinforcement learning to predict basically the red layer (given green and blue) of certain picture using already provided training and test set( given rgb image) submitted by /u/naam-_karan [link] [comments]
    Where are the Gym's robotic envs ? Discontinued ?
    (1 min) Hi I was trying to get some inspiration on robot env and thus looked at the gym envs, but the code doesnt seem to exist anymore https://gym.openai.com/envs/FetchSlide-v1/ ( see github link) There is a december commit removing the robotic envs. Anyone has a clue why ? and where can I find them ? Also, any recommendation on a good, readable robotics env to use as an production example ? Thanks ! submitted by /u/Ouassimf [link] [comments]
    Binding 2 agents together in NetLogo
    (1 min) Hi, Is there a possibility to bind 2 agents together in NetLogo to make a shape that looks like an H2 molecule ? It has to move and rotate like any other agent as well as to handle collision as a 2 circles part. Dual agents stacking together with 2 circles based collision If it is not possible, I can use Agents.jl in Julia or a Python library. Thanks submitted by /u/x11ry0 [link] [comments]
    Extending MMDRL to IQN
    (1 min) I'm reading the paper Distributional Reinforcement Learning via Moment Matching. In it the authors mention that while their method is based on QRDQN (generating static particles a la quantiles in QRDQN directly from Neural Net), it can be easily extended to IQN and FQF. They mention that instead of generating fixed network outputs, we can generate the particles via applying a neural network function to samples of a base sampling distribution as in IQN. What I don't understand is how we derive the gradients for the sample generating network? The loss for MMDDQN (based off of QRDQN) is done as follows (this is what I understand):given expected particles and target particles, we generate deltas, calculate some kernel function of it, and then calculate the maximum mean discrepancy. See code below batch_size, n_particles = expected_particles.shape assert expected_particles.shape == target_particles.shape delta1 = expected_particles[:, :, None] - expected_particles[:, None, :] delta2 = target_particles[:, :, None] - target_particles[:, None, :] delta3 = expected_particles[:, :, None] - target_particles[:, None, :] first_kernel = huber_loss(delta1, kappa) second_kernel = huber_loss(delta2, kappa) third_kernel = huber_loss(delta3, kappa) for x in (first_kernel, second_kernel, third_kernel): assert x.shape == (batch_size, n_particles, n_particles), x.shape first_item, second_item, third_item = ( kernel_func(x).mean() for x in (first_kernel, second_kernel, third_kernel) ) loss = first_item + second_item - 2 * third_item How do the samples generated from base sampling distribution in IQN (taus) fit in the above loss function? In IQN, the loss is derived from the td_error = expected_quantiles - target_quantiles and the generated taus by combining them using quantile huber loss. For extending MMD to IQN, do we just multiply the taus to first_kernel, second_kernel and third_kernel above and then calculate the kernel function of the results? submitted by /u/SirRantcelot [link] [comments]

2022-01-26

  • cs.LG updates on arXiv.org

    Inferring Hidden Structures in Random Graphs. (arXiv:2110.01901v2 [cs.DS] UPDATED)
    (0 min) We study the two inference problems of detecting and recovering an isolated community of \emph{general} structure planted in a random graph. The detection problem is formalized as a hypothesis testing problem, where under the null hypothesis, the graph is a realization of an Erd\H{o}s-R\'{e}nyi random graph $\mathcal{G}(n,q)$ with edge density $q\in(0,1)$; under the alternative, there is an unknown structure $\Gamma_k$ on $k$ nodes, planted in $\mathcal{G}(n,q)$, such that it appears as an \emph{induced subgraph}. In case of a successful detection, we are concerned with the task of recovering the corresponding structure. For these problems, we investigate the fundamental limits from both the statistical and computational perspectives. Specifically, we derive lower bounds for detecting/recovering the structure $\Gamma_k$ in terms of the parameters $(n,k,q)$, as well as certain properties of $\Gamma_k$, and exhibit computationally unbounded optimal algorithms that achieve these lower bounds. We also consider the problem of testing in polynomial-time. As is customary in many similar structured high-dimensional problems, our model undergoes an "easy-hard-impossible" phase transition and computational constraints can severely penalize the statistical performance. To provide an evidence for this phenomenon, we show that the class of low-degree polynomials algorithms match the statistical performance of the polynomial-time algorithms we develop.
    Consistent Relative Confidence and Label-Free Model Selection for Convolutional Neural Networks. (arXiv:2108.11845v2 [cs.CV] UPDATED)
    (0 min) This letter is concerned with image classification with deep convolutional neural networks (CNNs). The focus is on the following question: given a set of candidate CNN models, how to select the right one with the best generalization property for the current task? Present model selection methods require access to a batch of labeled data for computing a pre-specified performance metric, such as the cross-entropy loss, the classification error rate, the negative log-likelihood. In many practical cases, labels are not available in time as labeling itself is a time-consuming and expensive task. To this end, this letter presents an approach to CNN model selection using only unlabeled data. This method is developed based on a principle termed consistent relative confidence. The effectiveness and efficiency of the proposed method are demonstrated by experiments using benchmark datasets.
    Lifelong 3D Object Recognition and Grasp Synthesis Using Dual Memory Recurrent Self-Organization Networks. (arXiv:2109.11544v2 [cs.RO] UPDATED)
    (0 min) Humans learn to recognize and manipulate new objects in lifelong settings without forgetting the previously gained knowledge under non-stationary and sequential conditions. In autonomous systems, the agents also need to mitigate similar behavior to continually learn the new object categories and adapt to new environments. In most conventional deep neural networks, this is not possible due to the problem of catastrophic forgetting, where the newly gained knowledge overwrites existing representations. Furthermore, most state-of-the-art models excel either in recognizing the objects or in grasp prediction, while both tasks use visual input. The combined architecture to tackle both tasks is very limited. In this paper, we proposed a hybrid model architecture consists of a dynamically growing dual-memory recurrent neural network (GDM) and an autoencoder to tackle object recognition and grasping simultaneously. The autoencoder network is responsible to extract a compact representation for a given object, which serves as input for the GDM learning, and is responsible to predict pixel-wise antipodal grasp configurations. The GDM part is designed to recognize the object in both instances and categories levels. We address the problem of catastrophic forgetting using the intrinsic memory replay, where the episodic memory periodically replays the neural activation trajectories in the absence of external sensory information. To extensively evaluate the proposed model in a lifelong setting, we generate a synthetic dataset due to lack of sequential 3D objects dataset. Experiment results demonstrated that the proposed model can learn both object representation and grasping simultaneously in continual learning scenarios.
    Sinkformers: Transformers with Doubly Stochastic Attention. (arXiv:2110.11773v2 [cs.LG] UPDATED)
    (0 min) Attention based models such as Transformers involve pairwise interactions between data points, modeled with a learnable attention matrix. Importantly, this attention matrix is normalized with the SoftMax operator, which makes it row-wise stochastic. In this paper, we propose instead to use Sinkhorn's algorithm to make attention matrices doubly stochastic. We call the resulting model a Sinkformer. We show that the row-wise stochastic attention matrices in classical Transformers get close to doubly stochastic matrices as the number of epochs increases, justifying the use of Sinkhorn normalization as an informative prior. On the theoretical side, we show that, unlike the SoftMax operation, this normalization makes it possible to understand the iterations of self-attention modules as a discretized gradient-flow for the Wasserstein metric. We also show in the infinite number of samples limit that, when rescaling both attention matrices and depth, Sinkformers operate a heat diffusion. On the experimental side, we show that Sinkformers enhance model accuracy in vision and natural language processing tasks. In particular, on 3D shapes classification, Sinkformers lead to a significant improvement.
    Nonconvex Stochastic Scaled-Gradient Descent and Generalized Eigenvector Problems. (arXiv:2112.14738v2 [stat.ML] UPDATED)
    (0 min) Motivated by the problem of online canonical correlation analysis, we propose the \emph{Stochastic Scaled-Gradient Descent} (SSGD) algorithm for minimizing the expectation of a stochastic function over a generic Riemannian manifold. SSGD generalizes the idea of projected stochastic gradient descent and allows the use of scaled stochastic gradients instead of stochastic gradients. In the special case of a spherical constraint, which arises in generalized eigenvector problems, we establish a nonasymptotic finite-sample bound of $\sqrt{1/T}$, and show that this rate is minimax optimal, up to a polylogarithmic factor of relevant parameters. On the asymptotic side, a novel trajectory-averaging argument allows us to achieve local asymptotic normality with a rate that matches that of Ruppert-Polyak-Juditsky averaging. We bring these ideas together in an application to online canonical correlation analysis, deriving, for the first time in the literature, an optimal one-time-scale algorithm with an explicit rate of local asymptotic convergence to normality. Numerical studies of canonical correlation analysis are also provided for synthetic data.
    Polyak-Ruppert Averaged Q-Leaning is Statistically Efficient. (arXiv:2112.14582v2 [stat.ML] UPDATED)
    (0 min) We study synchronous Q-learning with Polyak-Ruppert averaging (a.k.a., averaged Q-learning) in a $\gamma$-discounted MDP. We establish a functional central limit theorem (FCLT) for the averaged iteration $\bar{\boldsymbol{Q}}_T$ and show its standardized partial-sum process weakly converges to a rescaled Brownian motion. Furthermore, we show that $\bar{\boldsymbol{Q}}_T$ is actually a regular asymptotically linear (RAL) estimator for the optimal Q-value function $\boldsymbol{Q}^*$ with the most efficient influence function. This implies the averaged Q-learning iteration has the smallest asymptotic variance among all RAL estimators. In addition, we present a non-asymptotic analysis for the $\ell_{\infty}$ error $\mathbb{E}\|\bar{\boldsymbol{Q}}_T-\boldsymbol{Q}^*\|_{\infty}$, showing for polynomial step sizes it matches the instance-dependent lower bound as well as the optimal minimax complexity lower bound. In short, our theoretical analysis shows averaged Q-learning is statistically efficient.
    On Well-posedness and Minimax Optimal Rates of Nonparametric Q-function Estimation in Off-policy Evaluation. (arXiv:2201.06169v2 [math.ST] UPDATED)
    (0 min) We study the off-policy evaluation (OPE) problem in an infinite-horizon Markov decision process with continuous states and actions. We recast the $Q$-function estimation into a special form of the nonparametric instrumental variables (NPIV) estimation problem. We first show that under one mild condition the NPIV formulation of $Q$-function estimation is well-posed in the sense of $L^2$-measure of ill-posedness with respect to the data generating distribution, bypassing a strong assumption on the discount factor $\gamma$ imposed in the recent literature for obtaining the $L^2$ convergence rates of various $Q$-function estimators. Thanks to this new well-posed property, we derive the first minimax lower bounds for the convergence rates of nonparametric estimation of $Q$-function and its derivatives in both sup-norm and $L^2$-norm, which are shown to be the same as those for the classical nonparametric regression (Stone, 1982). We then propose a sieve two-stage least squares estimator and establish its rate-optimality in both norms under some mild conditions. Our general results on the well-posedness and the minimax lower bounds are of independent interest to study not only other nonparametric estimators for $Q$-function but also efficient estimation on the value of any target policy in off-policy settings.
    MT-GBM: A Multi-Task Gradient Boosting Machine with Shared Decision Trees. (arXiv:2201.06239v2 [cs.LG] UPDATED)
    (0 min) Despite the success of deep learning in computer vision and natural language processing, Gradient Boosted Decision Tree (GBDT) is yet one of the most powerful tools for applications with tabular data such as e-commerce and FinTech. However, applying GBDT to multi-task learning is still a challenge. Unlike deep models that can jointly learn a shared latent representation across multiple tasks, GBDT can hardly learn a shared tree structure. In this paper, we propose Multi-task Gradient Boosting Machine (MT-GBM), a GBDT-based method for multi-task learning. The MT-GBM can find the shared tree structures and split branches according to multi-task losses. First, it assigns multiple outputs to each leaf node. Next, it computes the gradient corresponding to each output (task). Then, we also propose an algorithm to combine the gradients of all tasks and update the tree. Finally, we apply MT-GBM to LightGBM. Experiments show that our MT-GBM improves the performance of the main task significantly, which means the proposed MT-GBM is efficient and effective.
    MEMO: Test Time Robustness via Adaptation and Augmentation. (arXiv:2110.09506v2 [cs.LG] UPDATED)
    (0 min) While deep neural networks can attain good accuracy on in-distribution test points, many applications require robustness even in the face of unexpected perturbations in the input, changes in the domain, or other sources of distribution shift. We study the problem of test time robustification, i.e., using the test input to improve model robustness. Recent prior works have proposed methods for test time adaptation, however, they each introduce additional assumptions, such as access to multiple test points, that prevent widespread adoption. In this work, we aim to study and devise methods that make no assumptions about the model training process and are broadly applicable at test time. We propose a simple approach that can be used in any test setting where the model is probabilistic and adaptable: when presented with a test example, perform different data augmentations on the data point, and then adapt (all of) the model parameters by minimizing the entropy of the model's average, or marginal, output distribution across the augmentations. Intuitively, this objective encourages the model to make the same prediction across different augmentations, thus enforcing the invariances encoded in these augmentations, while also maintaining confidence in its predictions. In our experiments, we evaluate two baseline ResNet models, two robust ResNet-50 models, and a robust vision transformer model, and we demonstrate that this approach achieves accuracy gains of 1-8\% over standard model evaluation and also generally outperforms prior augmentation and adaptation strategies. For the setting in which only one test point is available, we achieve state-of-the-art results on the ImageNet-C, ImageNet-R, and, among ResNet-50 models, ImageNet-A distribution shift benchmarks.
    Uncertainty-Cognizant Model Predictive Control for Energy Management of Residential Buildings with PVT and Thermal Energy Storage. (arXiv:2201.08909v1 [eess.SY])
    (2 min) The building sector accounts for almost 40 percent of the global energy consumption. This reveals a great opportunity to exploit renewable energy resources in buildings to achieve the climate target. In this context, this paper offers a building energy system embracing a heat pump, a thermal energy storage system along with grid-connected photovoltaic thermal (PVT) collectors to supply both electric and thermal energy demands of the building with minimum operating cost. To this end, the paper develops a stochastic model predictive control (MPC) strategy to optimally determine the set-point of the whole building energy system while accounting for the uncertainties associated with the PVT energy generation. This system enables the building to 1-shift its electric demand from high-peak to off-peak hours and 2- sell electricity to the grid to make energy arbitrage.
    Stochastic normalizing flows as non-equilibrium transformations. (arXiv:2201.08862v1 [hep-lat])
    (2 min) Normalizing flows are a class of deep generative models that provide a promising route to sample lattice field theories more efficiently than conventional Monte~Carlo simulations. In this work we show that the theoretical framework of stochastic normalizing flows, in which neural-network layers are combined with Monte~Carlo updates, is the same that underlies out-of-equilibrium simulations based on Jarzynski's equality, which have been recently deployed to compute free-energy differences in lattice gauge theories. We lay out a strategy to optimize the efficiency of this extended class of generative models and present examples of applications.
    Self-Supervised Representation Learning on Neural Network Weights for Model Characteristic Prediction. (arXiv:2110.15288v4 [cs.LG] UPDATED)
    (0 min) Self-Supervised Learning (SSL) has been shown to learn useful and information-preserving representations. Neural Networks (NNs) are widely applied, yet their weight space is still not fully understood. Therefore, we propose to use SSL to learn hyper-representations of the weights of populations of NNs. To that end, we introduce domain specific data augmentations and an adapted attention architecture. Our empirical evaluation demonstrates that self-supervised representation learning in this domain is able to recover diverse NN model characteristics. Further, we show that the proposed learned representations outperform prior work for predicting hyper-parameters, test accuracy, and generalization gap as well as transfer to out-of-distribution settings.
    Bayesian Graph Contrastive Learning. (arXiv:2112.07823v2 [cs.LG] UPDATED)
    (0 min) Contrastive learning has become a key component of self-supervised learning approaches for graph-structured data. However, despite their success, existing graph contrastive learning methods are incapable of uncertainty quantification for node representations or their downstream tasks, limiting their application in high-stakes domains. In this paper, we propose a novel Bayesian perspective of graph contrastive learning methods showing random augmentations leads to stochastic encoders. As a result, our proposed method represents each node by a distribution in the latent space in contrast to existing techniques which embed each node to a deterministic vector. By learning distributional representations, we provide uncertainty estimates in downstream graph analytics tasks and increase the expressive power of the predictive model. In addition, we propose a Bayesian framework to infer the probability of perturbations in each view of the contrastive model, eliminating the need for a computationally expensive search for hyperparameter tuning. We empirically show a considerable improvement in performance compared to existing state-of-the-art methods on several benchmark datasets.
    Boosting Active Learning via Improving Test Performance. (arXiv:2112.05683v2 [cs.LG] UPDATED)
    (0 min) Central to active learning (AL) is what data should be selected for annotation. Existing works attempt to select highly uncertain or informative data for annotation. Nevertheless, it remains unclear how selected data impacts the test performance of the task model used in AL. In this work, we explore such an impact by theoretically proving that selecting unlabeled data of higher gradient norm leads to a lower upper-bound of test loss, resulting in better test performance. However, due to the lack of label information, directly computing gradient norm for unlabeled data is infeasible. To address this challenge, we propose two schemes, namely expected-gradnorm and entropy-gradnorm. The former computes the gradient norm by constructing an expected empirical loss while the latter constructs an unsupervised loss with entropy. Furthermore, we integrate the two schemes in a universal AL framework. We evaluate our method on classical image classification and semantic segmentation tasks. To demonstrate its competency in domain applications and its robustness to noise, we also validate our method on a cellular imaging analysis task, namely cryo-Electron Tomography subtomogram classification. Results demonstrate that our method achieves superior performance against the state of the art. Our source code is available at https://github.com/xulabs/aitom/blob/master/doc/projects/al_gradnorm.md.
    Early Myocardial Infarction Detection over Multi-view Echocardiography. (arXiv:2111.05790v2 [eess.IV] UPDATED)
    (0 min) Myocardial infarction (MI) is the leading cause of mortality in the world that occurs due to a blockage of the coronary arteries feeding the myocardium. An early diagnosis of MI and its localization can mitigate the extent of myocardial damage by facilitating early therapeutic interventions. Following the blockage of a coronary artery, the regional wall motion abnormality (RWMA) of the ischemic myocardial segments is the earliest change to set in. Echocardiography is the fundamental tool to assess any RWMA. Assessing the motion of the left ventricle (LV) wall only from a single echocardiography view may lead to missing the diagnosis of MI as the RWMA may not be visible on that specific view. Therefore, in this study, we propose to fuse apical 4-chamber (A4C) and apical 2-chamber (A2C) views in which a total of 12 myocardial segments can be analyzed for MI detection. The proposed method first estimates the motion of the LV wall by Active Polynomials (APs), which extract and track the endocardial boundary to compute myocardial segment displacements. The features are extracted from the A4C and A2C view displacements, which are concatenated and fed into the classifiers to detect MI. The main contributions of this study are 1) creation of a new benchmark dataset by including both A4C and A2C views in a total of 260 echocardiography recordings, which is publicly shared with the research community, 2) improving the performance of the prior work of threshold-based APs by a Machine Learning based approach, and 3) a pioneer MI detection approach via multi-view echocardiography by fusing the information of A4C and A2C views. Experimental results show that the proposed method achieves 90.91% sensitivity and 86.36% precision for MI detection over multi-view echocardiography.
    End-to-End Complex-Valued Multidilated Convolutional Neural Network for Joint Acoustic Echo Cancellation and Noise Suppression. (arXiv:2110.00745v3 [eess.AS] UPDATED)
    (0 min) Echo and noise suppression is an integral part of a full-duplex communication system. Many recent acoustic echo cancellation (AEC) systems rely on a separate adaptive filtering module for linear echo suppression and a neural module for residual echo suppression. However, not only do adaptive filtering modules require convergence and remain susceptible to changes in acoustic environments, but this two-stage framework also often introduces unnecessary delays to the AEC system when neural modules are already capable of both linear and nonlinear echo suppression. In this paper, we exploit the offset-compensating ability of complex time-frequency masks and propose an end-to-end complex-valued neural network architecture. The building block of the proposed model is a pseudocomplex extension based on the densely-connected multidilated DenseNet (D3Net) building block, resulting in a very small network of only 354K parameters. The architecture utilized the multi-resolution nature of the D3Net building blocks to eliminate the need for pooling, allowing the network to extract features using large receptive fields without any loss of output resolution. We also propose a dual-mask technique for joint echo and noise suppression with simultaneous speech enhancement. Evaluation on both synthetic and real test sets demonstrated promising results across multiple energy-based metrics and perceptual proxies.
    Towards Private Learning on Decentralized Graphs with Local Differential Privacy. (arXiv:2201.09398v1 [cs.LG])
    (0 min) Many real-world networks are inherently decentralized. For example, in social networks, each user maintains a local view of a social graph, such as a list of friends and her profile. It is typical to collect these local views of social graphs and conduct graph learning tasks. However, learning over graphs can raise privacy concerns as these local views often contain sensitive information. In this paper, we seek to ensure private graph learning on a decentralized network graph. Towards this objective, we propose {\em Solitude}, a new privacy-preserving learning framework based on graph neural networks (GNNs), with formal privacy guarantees based on edge local differential privacy. The crux of {\em Solitude} is a set of new delicate mechanisms that can calibrate the introduced noise in the decentralized graph collected from the users. The principle behind the calibration is the intrinsic properties shared by many real-world graphs, such as sparsity. Unlike existing work on locally private GNNs, our new framework can simultaneously protect node feature privacy and edge privacy, and can seamlessly incorporate with any GNN with privacy-utility guarantees. Extensive experiments on benchmarking datasets show that {\em Solitude} can retain the generalization capability of the learned GNN while preserving the users' data privacy under given privacy budgets.
    Optimization-Based Separations for Neural Networks. (arXiv:2112.02393v2 [cs.LG] UPDATED)
    (0 min) Depth separation results propose a possible theoretical explanation for the benefits of deep neural networks over shallower architectures, establishing that the former possess superior approximation capabilities. However, there are no known results in which the deeper architecture leverages this advantage into a provable optimization guarantee. We prove that when the data are generated by a distribution with radial symmetry which satisfies some mild assumptions, gradient descent can efficiently learn ball indicator functions using a depth 2 neural network with two layers of sigmoidal activations, and where the hidden layer is held fixed throughout training. By building on and refining existing techniques for approximation lower bounds of neural networks with a single layer of non-linearities, we show that there are $d$-dimensional radial distributions on the data such that ball indicators cannot be learned efficiently by any algorithm to accuracy better than $\Omega(d^{-4})$, nor by a standard gradient descent implementation to accuracy better than a constant. These results establish what is to the best of our knowledge, the first optimization-based separations where the approximation benefits of the stronger architecture provably manifest in practice. Our proof technique introduces new tools and ideas that may be of independent interest in the theoretical study of both the approximation and optimization of neural networks.
    System Norm Regularization Methods for Koopman Operator Approximation. (arXiv:2110.09658v2 [eess.SY] UPDATED)
    (0 min) Approximating the Koopman operator from data is numerically challenging when many lifting functions are considered. Even low-dimensional systems can yield unstable or ill-conditioned results in a high-dimensional lifted space. In this paper, Extended Dynamic Mode Decomposition (DMD) and DMD with control, two popular methods for approximating the Koopman operator, are reformulated as convex optimization problems with linear matrix inequality constraints. Hard asymptotic stability constraints and system norm regularizers are considered as methods to improve the numerical conditioning of the approximate Koopman operator. In particular, the H-infinity norm is used as a regularizer to penalize the input-output gain of the linear system defined by the Koopman operator. Weighting functions are then applied to penalize the system gain at specific frequencies. Experimental results using data from an aircraft fatigue structural test rig and a soft robot arm highlight the advantages of the proposed regression methods.
    ULSA: Unified Language of Synthesis Actions for Representation of Synthesis Protocols. (arXiv:2201.09329v1 [cs.LG])
    (0 min) Applying AI power to predict syntheses of novel materials requires high-quality, large-scale datasets. Extraction of synthesis information from scientific publications is still challenging, especially for extracting synthesis actions, because of the lack of a comprehensive labeled dataset using a solid, robust, and well-established ontology for describing synthesis procedures. In this work, we propose the first Unified Language of Synthesis Actions (ULSA) for describing ceramics synthesis procedures. We created a dataset of 3,040 synthesis procedures annotated by domain experts according to the proposed ULSA scheme. To demonstrate the capabilities of ULSA, we built a neural network-based model to map arbitrary ceramics synthesis paragraphs into ULSA and used it to construct synthesis flowcharts for synthesis procedures. Analysis for the flowcharts showed that (a) ULSA covers essential vocabulary used by researchers when describing synthesis procedures and (b) it can capture important features of synthesis protocols. This work is an important step towards creating a synthesis ontology and a solid foundation for autonomous robotic synthesis.
    Transformers Generalize DeepSets and Can be Extended to Graphs and Hypergraphs. (arXiv:2110.14416v2 [cs.LG] UPDATED)
    (0 min) We present a generalization of Transformers to any-order permutation invariant data (sets, graphs, and hypergraphs). We begin by observing that Transformers generalize DeepSets, or first-order (set-input) permutation invariant MLPs. Then, based on recently characterized higher-order invariant MLPs, we extend the concept of self-attention to higher orders and propose higher-order Transformers for order-$k$ data ($k=2$ for graphs and $k>2$ for hypergraphs). Unfortunately, higher-order Transformers turn out to have prohibitive complexity $\mathcal{O}(n^{2k})$ to the number of input nodes $n$. To address this problem, we present sparse higher-order Transformers that have quadratic complexity to the number of input hyperedges, and further adopt the kernel attention approach to reduce the complexity to linear. In particular, we show that the sparse second-order Transformers with kernel attention are theoretically more expressive than message passing operations while having an asymptotically identical complexity. Our models achieve significant performance improvement over invariant MLPs and message-passing graph neural networks in large-scale graph regression and set-to-(hyper)graph prediction tasks. Our implementation is available at https://github.com/jw9730/hot.
    Time Discretization-Invariant Safe Action Repetition for Policy Gradient Methods. (arXiv:2111.03941v5 [cs.LG] UPDATED)
    (0 min) In reinforcement learning, continuous time is often discretized by a time scale $\delta$, to which the resulting performance is known to be highly sensitive. In this work, we seek to find a $\delta$-invariant algorithm for policy gradient (PG) methods, which performs well regardless of the value of $\delta$. We first identify the underlying reasons that cause PG methods to fail as $\delta \to 0$, proving that the variance of the PG estimator can diverge to infinity in stochastic environments under a certain assumption of stochasticity. While durative actions or action repetition can be employed to have $\delta$-invariance, previous action repetition methods cannot immediately react to unexpected situations in stochastic environments. We thus propose a novel $\delta$-invariant method named Safe Action Repetition (SAR) applicable to any existing PG algorithm. SAR can handle the stochasticity of environments by adaptively reacting to changes in states during action repetition. We empirically show that our method is not only $\delta$-invariant but also robust to stochasticity, outperforming previous $\delta$-invariant approaches on eight MuJoCo environments with both deterministic and stochastic settings. Our code is available at https://vision.snu.ac.kr/projects/sar.
    Lensing Machines: Representing Perspective in Latent Variable Models. (arXiv:2201.08848v1 [cs.LG])
    (2 min) Many datasets represent a combination of different ways of looking at the same data that lead to different generalizations. For example, a corpus with examples generated by different people may be mixtures of many perspectives and can be viewed with different perspectives by others. It isnt always possible to represent the viewpoints by a clean separation, in advance, of examples representing each viewpoint and train a separate model for each viewpoint. We introduce lensing, a mixed initiative technique to extract lenses or mappings between machine learned representations and perspectives of human experts, and to generate lensed models that afford multiple perspectives of the same dataset. We apply lensing for two classes of latent variable models: a mixed membership model, a matrix factorization model in the context of two mental health applications, and we capture and imbue the perspectives of clinical psychologists into these models. Our work shows the benefits of the machine learning practitioner formally incorporating the perspective of a knowledgeable domain expert into their models rather than estimating unlensed models themselves in isolation.
    Differentiable Scaffolding Tree for Molecular Optimization. (arXiv:2109.10469v2 [cs.LG] UPDATED)
    (0 min) The structural design of functional molecules, also called molecular optimization, is an essential chemical science and engineering task with important applications, such as drug discovery. Deep generative models and combinatorial optimization methods achieve initial success but still struggle with directly modeling discrete chemical structures and often heavily rely on brute-force enumeration. The challenge comes from the discrete and non-differentiable nature of molecule structures. To address this, we propose differentiable scaffolding tree (DST) that utilizes a learned knowledge network to convert discrete chemical structures to locally differentiable ones. DST enables a gradient-based optimization on a chemical graph structure by back-propagating the derivatives from the target properties through a graph neural network (GNN). Our empirical studies show the gradient-based molecular optimizations are both effective and sample efficient. Furthermore, the learned graph parameters can also provide an explanation that helps domain experts understand the model output.
    Probability Distribution on Full Rooted Trees. (arXiv:2109.12825v4 [stat.ML] UPDATED)
    (0 min) The recursive and hierarchical structure of full rooted trees is applicable to represent statistical models in various areas, such as data compression, image processing, and machine learning. In most of these cases, the full rooted tree is not a random variable; as such, model selection to avoid overfitting becomes problematic. A method to solve this problem is to assume a prior distribution on the full rooted trees. This enables the optimal model selection based on the Bayes decision theory. For example, by assigning a low prior probability to a complex model, the maximum a posteriori estimator prevents the selection of the complex one. Furthermore, we can average all the models weighted by their posteriors. In this paper, we propose a probability distribution on a set of full rooted trees. Its parametric representation is suitable for calculating the properties of our distribution using recursive functions, such as the mode, expectation, and posterior distribution. Although such distributions have been proposed in previous studies, they are only applicable to specific applications. Therefore, we extract their mathematically essential components and derive new generalized methods to calculate the expectation, posterior distribution, etc.
    Convergence of Deep Convolutional Neural Networks. (arXiv:2109.13542v2 [cs.LG] UPDATED)
    (0 min) Convergence of deep neural networks as the depth of the networks tends to infinity is fundamental in building the mathematical foundation for deep learning. In a previous study, we investigated this question for deep ReLU networks with a fixed width. This does not cover the important convolutional neural networks where the widths are increasing from layer to layer. For this reason, we first study convergence of general ReLU networks with increasing widths and then apply the results obtained to deep convolutional neural networks. It turns out the convergence reduces to convergence of infinite products of matrices with increasing sizes, which has not been considered in the literature. We establish sufficient conditions for convergence of such infinite products of matrices. Based on the conditions, we present sufficient conditions for piecewise convergence of general deep ReLU networks with increasing widths, and as well as pointwise convergence of deep ReLU convolutional neural networks.
    In-Bed Human Pose Estimation from Unseen and Privacy-Preserving Image Domains. (arXiv:2111.15124v2 [cs.CV] UPDATED)
    (0 min) Medical applications have benefited greatly from the rapid advancement in computer vision. Considering patient monitoring in particular, in-bed human posture estimation offers important health-related metrics with potential value in medical condition assessments. Despite great progress in this domain, it remains challenging due to substantial ambiguity during occlusions, and the lack of large corpora of manually labeled data for model training, particularly with domains such as thermal infrared imaging which are privacy-preserving, and thus of great interest. Motivated by the effectiveness of self-supervised methods in learning features directly from data, we propose a multi-modal conditional variational autoencoder (MC-VAE) capable of reconstructing features from missing modalities seen during training. This approach is used with HRNet to enable single modality inference for in-bed pose estimation. Through extensive evaluations, we demonstrate that body positions can be effectively recognized from the available modality, achieving on par results with baseline models that are highly dependent on having access to multiple modes at inference time. The proposed framework supports future research towards self-supervised learning that generates a robust model from a single source, and expects it to generalize over many unknown distributions in clinical environments.
    Generative Adversarial Network Applications in Creating a Meta-Universe. (arXiv:2201.09152v1 [cs.CV])
    (0 min) Generative Adversarial Networks (GANs) are machine learning methods that are used in many important and novel applications. For example, in imaging science, GANs are effectively utilized in generating image datasets, photographs of human faces, image and video captioning, image-to-image translation, text-to-image translation, video prediction, and 3D object generation to name a few. In this paper, we discuss how GANs can be used to create an artificial world. More specifically, we discuss how GANs help to describe an image utilizing image/video captioning methods and how to translate the image to a new image using image-to-image translation frameworks in a theme we desire. We articulate how GANs impact creating a customized world.
    On Effective Scheduling of Model-based Reinforcement Learning. (arXiv:2111.08550v2 [cs.LG] UPDATED)
    (0 min) Model-based reinforcement learning has attracted wide attention due to its superior sample efficiency. Despite its impressive success so far, it is still unclear how to appropriately schedule the important hyperparameters to achieve adequate performance, such as the real data ratio for policy optimization in Dyna-style model-based algorithms. In this paper, we first theoretically analyze the role of real data in policy training, which suggests that gradually increasing the ratio of real data yields better performance. Inspired by the analysis, we propose a framework named AutoMBPO to automatically schedule the real data ratio as well as other hyperparameters in training model-based policy optimization (MBPO) algorithm, a representative running case of model-based methods. On several continuous control tasks, the MBPO instance trained with hyperparameters scheduled by AutoMBPO can significantly surpass the original one, and the real data ratio schedule found by AutoMBPO shows consistency with our theoretical analysis.
    A GNN-RNN Approach for Harnessing Geospatial and Temporal Information: Application to Crop Yield Prediction. (arXiv:2111.08900v2 [cs.LG] UPDATED)
    (0 min) Climate change is posing new challenges to crop-related concerns including food insecurity, supply stability and economic planning. As one of the central challenges, crop yield prediction has become a pressing task in the machine learning field. Despite its importance, the prediction task is exceptionally complicated since crop yields depend on various factors such as weather, land surface, soil quality as well as their interactions. In recent years, machine learning models have been successfully applied in this domain. However, these models either restrict their tasks to a relatively small region, or only study over a single or few years, which makes them hard to generalize spatially and temporally. In this paper, we introduce a novel graph-based recurrent neural network for crop yield prediction, to incorporate both geographical and temporal knowledge in the model, and further boost predictive power. Our method is trained, validated, and tested on over 2000 counties from 41 states in the US mainland, covering years from 1981 to 2019. As far as we know, this is the first machine learning method that embeds geographical knowledge in crop yield prediction and predicts the crop yields at county level nationwide. We also laid a solid foundation for the comparison with other machine learning baselines by applying well-known linear models, tree-based models, deep learning methods and comparing their performance. Experiments show that our proposed method consistently outperforms the existing state-of-the-art methods on various metrics, validating the effectiveness of geospatial and temporal information.
    Adversarial Unlearning of Backdoors via Implicit Hypergradient. (arXiv:2110.03735v3 [cs.LG] UPDATED)
    (0 min) We propose a minimax formulation for removing backdoors from a given poisoned model based on a small set of clean data. This formulation encompasses much of prior work on backdoor removal. We propose the Implicit Bacdoor Adversarial Unlearning (I-BAU) algorithm to solve the minimax. Unlike previous work, which breaks down the minimax into separate inner and outer problems, our algorithm utilizes the implicit hypergradient to account for the interdependence between inner and outer optimization. We theoretically analyze its convergence and the generalizability of the robustness gained by solving minimax on clean data to unseen test data. In our evaluation, we compare I-BAU with six state-of-art backdoor defenses on seven backdoor attacks over two datasets and various attack settings, including the common setting where the attacker targets one class as well as important but underexplored settings where multiple classes are targeted. I-BAU's performance is comparable to and most often significantly better than the best baseline. Particularly, its performance is more robust to the variation on triggers, attack settings, poison ratio, and clean data size. Moreover, I-BAU requires less computation to take effect; particularly, it is more than $13\times$ faster than the most efficient baseline in the single-target attack setting. Furthermore, it can remain effective in the extreme case where the defender can only access 100 clean samples -- a setting where all the baselines fail to produce acceptable results.
    SurvITE: Learning Heterogeneous Treatment Effects from Time-to-Event Data. (arXiv:2110.14001v2 [cs.LG] UPDATED)
    (0 min) We study the problem of inferring heterogeneous treatment effects from time-to-event data. While both the related problems of (i) estimating treatment effects for binary or continuous outcomes and (ii) predicting survival outcomes have been well studied in the recent machine learning literature, their combination -- albeit of high practical relevance -- has received considerably less attention. With the ultimate goal of reliably estimating the effects of treatments on instantaneous risk and survival probabilities, we focus on the problem of learning (discrete-time) treatment-specific conditional hazard functions. We find that unique challenges arise in this context due to a variety of covariate shift issues that go beyond a mere combination of well-studied confounding and censoring biases. We theoretically analyse their effects by adapting recent generalization bounds from domain adaptation and treatment effect estimation to our setting and discuss implications for model design. We use the resulting insights to propose a novel deep learning method for treatment-specific hazard estimation based on balancing representations. We investigate performance across a range of experimental settings and empirically confirm that our method outperforms baselines by addressing covariate shifts from various sources.
    Learning PSD-valued functions using kernel sums-of-squares. (arXiv:2111.11306v2 [stat.ML] UPDATED)
    (0 min) Shape constraints such as positive semi-definiteness (PSD) for matrices or convexity for functions play a central role in many applications in machine learning and sciences, including metric learning, optimal transport, and economics. Yet, very few function models exist that enforce PSD-ness or convexity with good empirical performance and theoretical guarantees. In this paper, we introduce a kernel sum-of-squares model for functions that take values in the PSD cone, which extends kernel sums-of-squares models that were recently proposed to encode non-negative scalar functions. We provide a representer theorem for this class of PSD functions, show that it constitutes a universal approximator of PSD functions, and derive eigenvalue bounds in the case of subsampled equality constraints. We then apply our results to modeling convex functions, by enforcing a kernel sum-of-squares representation of their Hessian, and show that any smooth and strongly convex function may be thus represented. Finally, we illustrate our methods on a PSD matrix-valued regression task, and on scalar-valued convex regression.
    Decentralized Stochastic Proximal Gradient Descent with Variance Reduction over Time-varying Networks. (arXiv:2112.10389v2 [cs.LG] UPDATED)
    (0 min) In decentralized learning, a network of nodes cooperate to minimize an overall objective function that is usually the finite-sum of their local objectives, and incorporates a non-smooth regularization term for the better generalization ability. Decentralized stochastic proximal gradient (DSPG) method is commonly used to train this type of learning models, while the convergence rate is retarded by the variance of stochastic gradients. In this paper, we propose a novel algorithm, namely DPSVRG, to accelerate the decentralized training by leveraging the variance reduction technique. The basic idea is to introduce an estimator in each node, which tracks the local full gradient periodically, to correct the stochastic gradient at each iteration. By transforming our decentralized algorithm into a centralized inexact proximal gradient algorithm with variance reduction, and controlling the bounds of error sequences, we prove that DPSVRG converges at the rate of $O(1/T)$ for general convex objectives plus a non-smooth term with $T$ as the number of iterations, while DSPG converges at the rate $O(\frac{1}{\sqrt{T}})$. Our experiments on different applications, network topologies and learning models demonstrate that DPSVRG converges much faster than DSPG, and the loss function of DPSVRG decreases smoothly along with the training epochs.
    Discovering Parametric Activation Functions. (arXiv:2006.03179v5 [cs.LG] UPDATED)
    (0 min) Recent studies have shown that the choice of activation function can significantly affect the performance of deep learning networks. However, the benefits of novel activation functions have been inconsistent and task dependent, and therefore the rectified linear unit (ReLU) is still the most commonly used. This paper proposes a technique for customizing activation functions automatically, resulting in reliable improvements in performance. Evolutionary search is used to discover the general form of the function, and gradient descent to optimize its parameters for different parts of the network and over the learning process. Experiments with four different neural network architectures on the CIFAR-10 and CIFAR-100 image classification datasets show that this approach is effective. It discovers both general activation functions and specialized functions for different architectures, consistently improving accuracy over ReLU and other activation functions by significant margins. The approach can therefore be used as an automated optimization step in applying deep learning to new tasks.
    KalmanNet: Neural Network Aided Kalman Filtering for Partially Known Dynamics. (arXiv:2107.10043v2 [eess.SP] UPDATED)
    (0 min) State estimation of dynamical systems in real-time is a fundamental task in signal processing. For systems that are well-represented by a fully known linear Gaussian state space (SS) model, the celebrated Kalman filter (KF) is a low complexity optimal solution. However, both linearity of the underlying SS model and accurate knowledge of it are often not encountered in practice. Here, we present KalmanNet, a real-time state estimator that learns from data to carry out Kalman filtering under non-linear dynamics with partial information. By incorporating the structural SS model with a dedicated recurrent neural network module in the flow of the KF, we retain data efficiency and interpretability of the classic algorithm while implicitly learning complex dynamics from data. We numerically demonstrate that KalmanNet overcomes non-linearities and model mismatch, outperforming classic filtering methods operating with both mismatched and accurate domain knowledge.
    The Role of Lookahead and Approximate Policy Evaluation in Reinforcement Learning with Linear Value Function Approximation. (arXiv:2109.13419v4 [cs.LG] UPDATED)
    (0 min) Reinforcement learning uses a number of techniques to learn a near-optimal optimal policy for very large MDPs by approximately solving the dynamic programming problem, including lookahead, approximate policy evaluation using an m-step return, and function approximation. In a recent paper, (Efroni et al. 2019) studied the impact of lookahead on the convergence rate of approximate dynamic programming. In this paper, we show that these convergence results change dramatically when function approximation is used in conjunction with lookout and approximate policy evaluation using an m-step return. Specifically, we show that when linear function approximation is used to represent the value function, a certain minimum amount of lookahead and multi-step return is needed for the algorithm to even converge. And when this condition is met, we characterize the performance of policies obtained using such approximate policy iteration. Our results are presented for two different procedures to compute the function approximation: linear least-squares regression and gradient descent.
    Space-Time Graph Neural Networks. (arXiv:2110.02880v2 [cs.LG] UPDATED)
    (0 min) We introduce space-time graph neural network (ST-GNN), a novel GNN architecture, tailored to jointly process the underlying space-time topology of time-varying network data. The cornerstone of our proposed architecture is the composition of time and graph convolutional filters followed by pointwise nonlinear activation functions. We introduce a generic definition of convolution operators that mimic the diffusion process of signals over its underlying support. On top of this definition, we propose space-time graph convolutions that are built upon a composition of time and graph shift operators. We prove that ST-GNNs with multivariate integral Lipschitz filters are stable to small perturbations in the underlying graphs as well as small perturbations in the time domain caused by time warping. Our analysis shows that small variations in the network topology and time evolution of a system does not significantly affect the performance of ST-GNNs. Numerical experiments with decentralized control systems showcase the effectiveness and stability of the proposed ST-GNNs.
    Implicit Bias of Projected Subgradient Method Gives Provable Robust Recovery of Subspaces of Unknown Codimension. (arXiv:2201.09079v1 [cs.CV])
    (0 min) Robust subspace recovery (RSR) is a fundamental problem in robust representation learning. Here we focus on a recently proposed RSR method termed Dual Principal Component Pursuit (DPCP) approach, which aims to recover a basis of the orthogonal complement of the subspace and is amenable to handling subspaces of high relative dimension. Prior work has shown that DPCP can provably recover the correct subspace in the presence of outliers, as long as the true dimension of the subspace is known. We show that DPCP can provably solve RSR problems in the {\it unknown} subspace dimension regime, as long as orthogonality constraints -- adopted in previous DPCP formulations -- are relaxed and random initialization is used instead of spectral one. Namely, we propose a very simple algorithm based on running multiple instances of a projected sub-gradient descent method (PSGM), with each problem instance seeking to find one vector in the null space of the subspace. We theoretically prove that under mild conditions this approach will succeed with high probability. In particular, we show that 1) all of the problem instances will converge to a vector in the nullspace of the subspace and 2) the ensemble of problem instance solutions will be sufficiently diverse to fully span the nullspace of the subspace thus also revealing its true unknown codimension. We provide empirical results that corroborate our theoretical results and showcase the remarkable implicit rank regularization behavior of PSGM algorithm that allows us to perform RSR without being aware of the subspace dimension.
    Core Challenges in Embodied Vision-Language Planning. (arXiv:2106.13948v3 [cs.LG] UPDATED)
    (0 min) Recent advances in the areas of multimodal machine learning and artificial intelligence (AI) have led to the development of challenging tasks at the intersection of Computer Vision, Natural Language Processing, and Embodied AI. Whereas many approaches and previous survey pursuits have characterised one or two of these dimensions, there has not been a holistic analysis at the center of all three. Moreover, even when combinations of these topics are considered, more focus is placed on describing, e.g., current architectural methods, as opposed to also illustrating high-level challenges and opportunities for the field. In this survey paper, we discuss Embodied Vision-Language Planning (EVLP) tasks, a family of prominent embodied navigation and manipulation problems that jointly use computer vision and natural language. We propose a taxonomy to unify these tasks and provide an in-depth analysis and comparison of the new and current algorithmic approaches, metrics, simulated environments, as well as the datasets used for EVLP tasks. Finally, we present the core challenges that we believe new EVLP works should seek to address, and we advocate for task construction that enables model generalizability and furthers real-world deployment.
    Random Offset Block Embedding Array (ROBE) for CriteoTB Benchmark MLPerf DLRM Model : 1000$\times$ Compression and 3.1$\times$ Faster Inference. (arXiv:2108.02191v2 [cs.IR] UPDATED)
    (0 min) Deep learning for recommendation data is one of the most pervasive and challenging AI workload in recent times. State-of-the-art recommendation models are one of the largest models matching the likes of GPT-3 and Switch Transformer. Challenges in deep learning recommendation models (DLRM) stem from learning dense embeddings for each of the categorical tokens. These embedding tables in industrial scale models can be as large as hundreds of terabytes. Such large models lead to a plethora of engineering challenges, not to mention prohibitive communication overheads, and slower training and inference times. Of these, slower inference time directly impacts user experience. Model compression for DLRM is gaining traction and the community has recently shown impressive compression results. In this paper, we present Random Offset Block Embedding Array (ROBE) as a low memory alternative to embedding tables which provide orders of magnitude reduction in memory usage while maintaining accuracy and boosting execution speed. ROBE is a simple fundamental approach in improving both cache performance and the variance of randomized hashing, which could be of independent interest in itself. We demonstrate that we can successfully train DLRM models with same accuracy while using $1000 \times$ less memory. A $1000\times$ compressed model directly results in faster inference without any engineering effort. In particular, we show that we can train DLRM model using ROBE array of size 100MB on a single GPU to achieve AUC of 0.8025 or higher as required by official MLPerf CriteoTB benchmark DLRM model of 100GB while achieving about $3.1\times$ (209\%) improvement in inference throughput.
    Fast and Efficient MMD-based Fair PCA via Optimization over Stiefel Manifold. (arXiv:2109.11196v3 [stat.ML] UPDATED)
    (0 min) This paper defines fair principal component analysis (PCA) as minimizing the maximum mean discrepancy (MMD) between dimensionality-reduced conditional distributions of different protected classes. The incorporation of MMD naturally leads to an exact and tractable mathematical formulation of fairness with good statistical properties. We formulate the problem of fair PCA subject to MMD constraints as a non-convex optimization over the Stiefel manifold and solve it using the Riemannian Exact Penalty Method with Smoothing (REPMS; Liu and Boumal, 2019). Importantly, we provide local optimality guarantees and explicitly show the theoretical effect of each hyperparameter in practical settings, extending previous results. Experimental comparisons based on synthetic and UCI datasets show that our approach outperforms prior work in explained variance, fairness, and runtime.
    Fisher Task Distance and Its Applications in Neural Architecture Search and Transfer Learning. (arXiv:2103.12827v4 [cs.LG] UPDATED)
    (0 min) We formulate an asymmetric (or non-commutative) distance between tasks based on FisherInformation Matrices. We provide a proof of consistency for our distance through theorems and experiments on various classification tasks. We then apply our proposed measure of task distance in transfer learning on visual tasks in the Taskonomy dataset. Additionally, we show how the proposed distance between a target task and a set of baseline tasks can be used to reduce the neural architecture search space for the target task. The complexity reduction in search space for task-specific architectures is achieved by building on the optimized architectures for similar tasks instead of doing a full search and without using this side information. Experimental results demonstrate the efficacy of the proposed approach and its improvements over other methods.
    Rethinking the Backdoor Attacks' Triggers: A Frequency Perspective. (arXiv:2104.03413v3 [cs.LG] UPDATED)
    (0 min) Backdoor attacks have been considered a severe security threat to deep learning. Such attacks can make models perform abnormally on inputs with predefined triggers and still retain state-of-the-art performance on clean data. While backdoor attacks have been thoroughly investigated in the image domain from both attackers' and defenders' sides, an analysis in the frequency domain has been missing thus far. This paper first revisits existing backdoor triggers from a frequency perspective and performs a comprehensive analysis. Our results show that many current backdoor attacks exhibit severe high-frequency artifacts, which persist across different datasets and resolutions. We further demonstrate these high-frequency artifacts enable a simple way to detect existing backdoor triggers at a detection rate of 98.50% without prior knowledge of the attack details and the target model. Acknowledging previous attacks' weaknesses, we propose a practical way to create smooth backdoor triggers without high-frequency artifacts and study their detectability. We show that existing defense works can benefit by incorporating these smooth triggers into their design consideration. Moreover, we show that the detector tuned over stronger smooth triggers can generalize well to unseen weak smooth triggers. In short, our work emphasizes the importance of considering frequency analysis when designing both backdoor attacks and defenses in deep learning.
    Bottom-Up Skill Discovery from Unsegmented Demonstrations for Long-Horizon Robot Manipulation. (arXiv:2109.13841v2 [cs.RO] UPDATED)
    (0 min) We tackle real-world long-horizon robot manipulation tasks through skill discovery. We present a bottom-up approach to learning a library of reusable skills from unsegmented demonstrations and use these skills to synthesize prolonged robot behaviors. Our method starts with constructing a hierarchical task structure from each demonstration through agglomerative clustering. From the task structures of multi-task demonstrations, we identify skills based on the recurring patterns and train goal-conditioned sensorimotor policies with hierarchical imitation learning. Finally, we train a meta controller to compose these skills to solve long-horizon manipulation tasks. The entire model can be trained on a small set of human demonstrations collected within 30 minutes without further annotations, making it amendable to real-world deployment. We systematically evaluated our method in simulation environments and on a real robot. Our method has shown superior performance over state-of-the-art imitation learning methods in multi-stage manipulation tasks. Furthermore, skills discovered from multi-task demonstrations boost the average task success by $8\%$ compared to those discovered from individual tasks.
    A Direct Slip Ratio Estimation Method based on an Intelligent Tire and Machine Learning. (arXiv:2106.08961v3 [cs.LG] UPDATED)
    (0 min) Accurate estimation of the tire slip ratio is critical for vehicle safety, as it is necessary for vehicle control purposes. In this paper, an intelligent tire system is presented to develop a novel slip ratio estimation model using machine learning algorithms. The accelerations, generated by a triaxial accelerometer installed onto the inner liner of the tire, are varied when the tire rotates to update the contact patch. Meanwhile, the slip ratio reference value can be measured by the MTS Flat-Trac tire test platform. Then, by analyzing the variation between the accelerations and slip ratio, highly useful features are discovered, which are especially promising for assessing vertical acceleration. For these features, machine learning (ML) algorithms are trained to build the slip ratio estimation model, in which the ML algorithms include artificial neural networks (ANNs), gradient boosting machines (GBMs), random forests (RFs), and support vector machines (SVMs). Finally, the estimated NRMS errors are evaluated using 10-fold cross-validation (CV). The proposed estimation model is able to estimate the slip ratio continuously and stably using only the acceleration from the intelligent tire system, and the estimated slip ratio range can reach 30%. The estimation results have high robustness to vehicle velocity and load, where the best NRMS errors can reach 4.88%. In summary, the present study with the fusion of an intelligent tire system and machine learning paves the way for the accurate estimation of the tire slip ratio under different driving conditions, which create new opportunities for autonomous vehicles, intelligent tires, and tire slip ratio estimation.
    Construction Cost Index Forecasting: A Multi-feature Fusion Approach. (arXiv:2108.10155v3 [cs.LG] UPDATED)
    (0 min) The construction cost index is an important indicator of the construction industry. Predicting CCI has important practical significance. This paper combines information fusion with machine learning, and proposes a multi-feature fusion (MFF) module for time series forecasting. Compared with the convolution module, the MFF module is a module that extracts certain features. Experiments have proved that the combination of MFF module and multi-layer perceptron has a relatively good prediction effect. The MFF neural network model has high prediction accuracy and efficient prediction efficiency. At the same time, MFF continues to improve the potential of prediction accuracy, which is a study of continuous attention.
    Approximate Latent Force Model Inference. (arXiv:2109.11851v3 [cs.LG] UPDATED)
    (0 min) Physically-inspired latent force models offer an interpretable alternative to purely data driven tools for inference in dynamical systems. They carry the structure of differential equations and the flexibility of Gaussian processes, yielding interpretable parameters and dynamics-imposed latent functions. However, the existing inference techniques associated with these models rely on the exact computation of posterior kernel terms which are seldom available in analytical form. Most applications relevant to practitioners, such as Hill equations or diffusion equations, are hence intractable. In this paper, we overcome these computational problems by proposing a variational solution to a general class of non-linear and parabolic partial differential equation latent force models. Further, we show that a neural operator approach can scale our model to thousands of instances, enabling fast, distributed computation. We demonstrate the efficacy and flexibility of our framework by achieving competitive performance on several tasks where the kernels are of varying degrees of tractability.
    Task Affinity with Maximum Bipartite Matching in Few-Shot Learning. (arXiv:2110.02399v2 [cs.LG] UPDATED)
    (0 min) We propose an asymmetric affinity score for representing the complexity of utilizing the knowledge of one task for learning another one. Our method is based on the maximum bipartite matching algorithm and utilizes the Fisher Information matrix. We provide theoretical analyses demonstrating that the proposed score is mathematically well-defined, and subsequently use the affinity score to propose a novel algorithm for the few-shot learning problem. In particular, using this score, we find relevant training data labels to the test data and leverage the discovered relevant data for episodically fine-tuning a few-shot model. Results on various few-shot benchmark datasets demonstrate the efficacy of the proposed approach by improving the classification accuracy over the state-of-the-art methods even when using smaller models.
    Modular Deep Reinforcement Learning for Continuous Motion Planning with Temporal Logic. (arXiv:2102.12855v7 [cs.LG] UPDATED)
    (0 min) This paper investigates the motion planning of autonomous dynamical systems modeled by Markov decision processes (MDP) with unknown transition probabilities over continuous state and action spaces. Linear temporal logic (LTL) is used to specify high-level tasks over infinite horizon, which can be converted into a limit deterministic generalized B\"uchi automaton (LDGBA) with several accepting sets. The novelty is to design an embedded product MDP (EP-MDP) between the LDGBA and the MDP by incorporating a synchronous tracking-frontier function to record unvisited accepting sets of the automaton, and to facilitate the satisfaction of the accepting conditions. The proposed LDGBA-based reward shaping and discounting schemes for the model-free reinforcement learning (RL) only depend on the EP-MDP states and can overcome the issues of sparse rewards. Rigorous analysis shows that any RL method that optimizes the expected discounted return is guaranteed to find an optimal policy whose traces maximize the satisfaction probability. A modular deep deterministic policy gradient (DDPG) is then developed to generate such policies over continuous state and action spaces. The performance of our framework is evaluated via an array of OpenAI gym environments.
    Partial Graph Reasoning for Neural Network Regularization. (arXiv:2106.01805v2 [cs.LG] UPDATED)
    (0 min) Regularizers help deep neural networks prevent feature co-adaptations. Dropout, as a commonly used regularization technique, stochastically disables neuron activations during network optimization. However, such complete feature disposal can affect the feature representation and network understanding. Toward better descriptions of latent representations, we present DropGraph that learns a regularization function by constructing a stand-alone graph from the backbone features. DropGraph first samples stochastic spatial feature vectors and then incorporates graph reasoning methods to generate feature map distortions. This add-on graph regularizes the network during training and can be completely skipped during inference. We provide intuitions on the linkage between graph reasoning and Dropout with further discussions on how partial graph reasoning method reduces feature correlations. To this end, we extensively study the modeling of graph vertex dependencies and the utilization of the graph for distorting backbone feature maps. DropGraph was validated on 4 tasks with a total of 8 different datasets. The experimental results show that our method outperforms other state-of-the-art regularizers while leaving the base model structure unmodified during inference.
    An Optimal Control Framework for Joint-channel Parallel MRI Reconstruction without Coil Sensitivities. (arXiv:2109.09738v2 [eess.IV] UPDATED)
    (0 min) Goal: This work aims at developing a novel calibration-free fast parallel MRI (pMRI) reconstruction method incorporate with discrete-time optimal control framework. The reconstruction model is designed to learn a regularization that combines channels and extracts features by leveraging the information sharing among channels of multi-coil images. We propose to recover both magnitude and phase information by taking advantage of structured convolutional networks in image and Fourier spaces. Methods: We develop a novel variational model with a learnable objective function that integrates an adaptive multi-coil image combination operator and effective image regularization in the image and Fourier spaces. We cast the reconstruction network as a structured discrete-time optimal control system, resulting in an optimal control formulation of parameter training where the parameters of the objective function play the role of control variables. We demonstrate that the Lagrangian method for solving the control problem is equivalent to back-propagation, ensuring the local convergence of the training algorithm. Results: We conduct a large number of numerical experiments of the proposed method with comparisons to several state-of-the-art pMRI reconstruction networks on real pMRI datasets. The numerical results demonstrate the promising performance of the proposed method evidently. Conclusion: We conduct a large number of numerical experiments of the proposed method with comparisons to several state-of-the-art pMRI reconstruction networks on real pMRI datasets. The numerical results demonstrate the promising performance of the proposed method evidently. Significance: By learning multi-coil image combination operator and performing regularizations in both image domain and k-space domain, the proposed method achieves a highly efficient image reconstruction network for pMRI.
    A Lyapunov-Based Methodology for Constrained Optimization with Bandit Feedback. (arXiv:2106.05165v3 [cs.LG] UPDATED)
    (0 min) In a wide variety of applications including online advertising, contractual hiring, and wireless scheduling, the controller is constrained by a stringent budget constraint on the available resources, which are consumed in a random amount by each action, and a stochastic feasibility constraint that may impose important operational limitations on decision-making. In this work, we consider a general model to address such problems, where each action returns a random reward, cost, and penalty from an unknown joint distribution, and the decision-maker aims to maximize the total reward under a budget constraint $B$ on the total cost and a stochastic constraint on the time-average penalty. We propose a novel low-complexity algorithm based on Lyapunov optimization methodology, named ${\tt LyOn}$, and prove that for $K$ arms it achieves $O(\sqrt{K B\log B})$ regret and zero constraint-violation when $B$ is sufficiently large. The low computational cost and sharp performance bounds of ${\tt LyOn}$ suggest that Lyapunov-based algorithm design methodology can be effective in solving constrained bandit optimization problems.
    Neuronal Correlation: a Central Concept in Neural Network. (arXiv:2201.09069v1 [cs.LG])
    (2 min) This paper proposes to study neural networks through neuronal correlation, a statistical measure of correlated neuronal activity on the penultimate layer. We show that neuronal correlation can be efficiently estimated via weight matrix, can be effectively enforced through layer structure, and is a strong indicator of generalisation ability of the network. More importantly, we show that neuronal correlation significantly impacts on the accuracy of entropy estimation in high-dimensional hidden spaces. While previous estimation methods may be subject to significant inaccuracy due to implicit assumption on neuronal independence, we present a novel computational method to have an efficient and authentic computation of entropy, by taking into consideration the neuronal correlation. In doing so, we install neuronal correlation as a central concept of neural network.
    Mechanical Search on Shelves using a Novel "Bluction" Tool. (arXiv:2201.08968v1 [cs.RO])
    (2 min) Shelves are common in homes, warehouses, and commercial settings due to their storage efficiency. However, this efficiency comes at the cost of reduced visibility and accessibility. When looking from a side (lateral) view of a shelf, most objects will be fully occluded, resulting in a constrained lateral-access mechanical search problem. To address this problem, we introduce: (1) a novel bluction tool, which combines a thin pushing blade and suction cup gripper, (2) an improved LAX-RAY simulation pipeline and perception model that combines ray-casting with 2D Minkowski sums to efficiently generate target occupancy distributions, and (3) a novel SLAX-RAY search policy, which optimally reduces target object distribution support area using the bluction tool. Experimental data from 2000 simulated shelf trials and 18 trials with a physical Fetch robot equipped with the bluction tool suggest that using suction grasping actions improves the success rate over the highest performing push-only policy by 26% in simulation and 67% in physical environments.
    Environment Generation for Zero-Shot Compositional Reinforcement Learning. (arXiv:2201.08896v1 [cs.LG])
    (2 min) Many real-world problems are compositional - solving them requires completing interdependent sub-tasks, either in series or in parallel, that can be represented as a dependency graph. Deep reinforcement learning (RL) agents often struggle to learn such complex tasks due to the long time horizons and sparse rewards. To address this problem, we present Compositional Design of Environments (CoDE), which trains a Generator agent to automatically build a series of compositional tasks tailored to the RL agent's current skill level. This automatic curriculum not only enables the agent to learn more complex tasks than it could have otherwise, but also selects tasks where the agent's performance is weak, enhancing its robustness and ability to generalize zero-shot to unseen tasks at test-time. We analyze why current environment generation techniques are insufficient for the problem of generating compositional tasks, and propose a new algorithm that addresses these issues. Our results assess learning and generalization across multiple compositional tasks, including the real-world problem of learning to navigate and interact with web pages. We learn to generate environments composed of multiple pages or rooms, and train RL agents capable of completing wide-range of complex tasks in those environments. We contribute two new benchmark frameworks for generating compositional tasks, compositional MiniGrid and gMiniWoB for web navigation.CoDE yields 4x higher success rate than the strongest baseline, and demonstrates strong performance of real websites learned on 3500 primitive tasks.
    Survival Prediction of Children Undergoing Hematopoietic Stem Cell Transplantation Using Different Machine Learning Classifiers by Performing Chi-squared Test and Hyper-parameter Optimization: A Retrospective Analysis. (arXiv:2201.08987v1 [cs.LG])
    (2 min) Bone Marrow Transplant, a gradational rescue for a wide range of disorders emanating from the bone marrow, is an efficacious surgical treatment. Several risk factors, such as post-transplant illnesses, new malignancies, and even organ damage, can impair long-term survival. Therefore, technologies like Machine Learning are deployed for investigating the survival prediction of BMT receivers along with the influences that limit their resilience. In this study, an efficient survival classification model is presented in a comprehensive manner, incorporating the Chi-squared feature selection method to address the dimensionality problem and Hyper Parameter Optimization (HPO) to increase accuracy. A synthetic dataset is generated by imputing the missing values, transforming the data using dummy variable encoding, and compressing the dataset from 59 features to the 11 most correlated features using Chi-squared feature selection. The dataset was split into train and test sets at a ratio of 80:20, and the hyperparameters were optimized using Grid Search Cross-Validation. Several supervised ML methods were trained in this regard, like Decision Tree, Random Forest, Logistic Regression, K-Nearest Neighbors, Gradient Boosting Classifier, Ada Boost, and XG Boost. The simulations have been performed for both the default and optimized hyperparameters by using the original and reduced synthetic dataset. After ranking the features using the Chi-squared test, it was observed that the top 11 features with HPO, resulted in the same accuracy of prediction (94.73%) as the entire dataset with default parameters. Moreover, this approach requires less time and resources for predicting the survivability of children undergoing BMT. Hence, the proposed approach may aid in the development of a computer-aided diagnostic system with satisfactory accuracy and minimal computation time by utilizing medical data records.
    On the in vivo recognition of kidney stones using machine learning. (arXiv:2201.08865v1 [eess.IV])
    (2 min) Determining the type of kidney stones allows urologists to prescribe a treatment to avoid recurrence of renal lithiasis. An automated in-vivo image-based classification method would be an important step towards an immediate identification of the kidney stone type required as a first phase of the diagnosis. In the literature it was shown on ex-vivo data (i.e., in very controlled scene and image acquisition conditions) that an automated kidney stone classification is indeed feasible. This pilot study compares the kidney stone recognition performances of six shallow machine learning methods and three deep-learning architectures which were tested with in-vivo images of the four most frequent urinary calculi types acquired with an endoscope during standard ureteroscopies. This contribution details the database construction and the design of the tested kidney stones classifiers. Even if the best results were obtained by the Inception v3 architecture (weighted precision, recall and F1-score of 0.97, 0.98 and 0.97, respectively), it is also shown that choosing an appropriate colour space and texture features allows a shallow machine learning method to approach closely the performances of the most promising deep-learning methods (the XGBoost classifier led to weighted precision, recall and F1-score values of 0.96). This paper is the first one that explores the most discriminant features to be extracted from images acquired during ureteroscopies.
    Online Attentive Kernel-Based Temporal Difference Learning. (arXiv:2201.09065v1 [cs.LG])
    (2 min) With rising uncertainty in the real world, online Reinforcement Learning (RL) has been receiving increasing attention due to its fast learning capability and improving data efficiency. However, online RL often suffers from complex Value Function Approximation (VFA) and catastrophic interference, creating difficulty for the deep neural network to be applied to an online RL algorithm in a fully online setting. Therefore, a simpler and more adaptive approach is introduced to evaluate value function with the kernel-based model. Sparse representations are superior at handling interference, indicating that competitive sparse representations should be learnable, non-prior, non-truncated and explicit when compared with current sparse representation methods. Moreover, in learning sparse representations, attention mechanisms are utilized to represent the degree of sparsification, and a smooth attentive function is introduced into the kernel-based VFA. In this paper, we propose an Online Attentive Kernel-Based Temporal Difference (OAKTD) algorithm using two-timescale optimization and provide convergence analysis of our proposed algorithm. Experimental evaluations showed that OAKTD outperformed several Online Kernel-based Temporal Difference (OKTD) learning algorithms in addition to the Temporal Difference (TD) learning algorithm with Tile Coding on public Mountain Car, Acrobot, CartPole and Puddle World tasks.
    Nearest Class-Center Simplification through Intermediate Layers. (arXiv:2201.08924v1 [cs.LG])
    (2 min) Recent advances in theoretical Deep Learning have introduced geometric properties that occur during training, past the Interpolation Threshold -- where the training error reaches zero. We inquire into the phenomena coined Neural Collapse in the intermediate layers of the networks, and emphasize the innerworkings of Nearest Class-Center Mismatch inside the deepnet. We further show that these processes occur both in vision and language model architectures. Lastly, we propose a Stochastic Variability-Simplification Loss (SVSL) that encourages better geometrical features in intermediate layers, and improves both train metrics and generalization.
    Escaping the Impossibility of Fairness: From Formal to Substantive Algorithmic Fairness. (arXiv:2107.04642v4 [cs.CY] UPDATED)
    (0 min) Algorithmic fairness provides novel methods for promoting equitable public policy using machine learning. Yet the narrow formulation of algorithmic fairness often provides cover for algorithms that exacerbate oppression, leading critics to call for a more justice-oriented approach. This article takes up these calls and proposes a method for operationalizing a social justice orientation into algorithmic fairness. First, I argue that algorithmic fairness suffers from a significant methodological limitation: it restricts analysis to isolated decision points. Because algorithmic fairness relies on this narrow scope of analysis, it yields a reform strategy that is fundamentally constrained by the "impossibility of fairness" (an incompatibility between mathematical definitions of fairness). Second, in light of these flaws, I draw on theories of substantive equality from law and philosophy to propose an alternative methodology: "substantive algorithmic fairness." Because substantive algorithmic fairness takes a more expansive scope to fairness, it suggests reform strategies that escape from the impossibility of fairness. These strategies provide a rigorous guide for employing algorithms to alleviate social injustice. In sum, substantive algorithmic fairness presents a new direction for the field of algorithmic fairness: away from formal mathematical models of "fairness" and toward substantive evaluations of how algorithms can (and cannot) promote justice.
    Semi-Empirical Objective Functions for MCMC Proposal Optimization. (arXiv:2106.02104v3 [cs.LG] UPDATED)
    (0 min) Current objective functions used for training neural MCMC proposal distributions implicitly rely on architectural restrictions to yield sensible optimization results, which hampers the development of highly expressive neural MCMC proposal architectures. In this work, we introduce and demonstrate a semi-empirical procedure for determining approximate objective functions suitable for optimizing arbitrarily parameterized proposal distributions in MCMC methods. Our proposed Ab Initio objective functions consist of the weighted combination of functions following constraints on their global optima and transformation invariances that we argue should be upheld by general measures of MCMC efficiency for use in proposal optimization. Our experimental results demonstrate that Ab Initio objective functions maintain favorable performance and preferable optimization behavior compared to existing objective functions for neural MCMC optimization. We find that Ab Initio objective functions are sufficiently robust to enable the confident optimization of neural proposal distributions parameterized by deep generative networks extending beyond the regimes of traditional MCMC schemes
    Recurrent Neural Network-based Internal Model Control design for stable nonlinear systems. (arXiv:2108.04585v2 [cs.LG] UPDATED)
    (0 min) Owing to their superior modeling capabilities, gated Recurrent Neural Networks, such as Gated Recurrent Units (GRUs) and Long Short-Term Memory networks (LSTMs), have become popular tools for learning dynamical systems. This paper aims to discuss how these networks can be adopted for the synthesis of Internal Model Control (IMC) architectures. To this end, first a gated recurrent network is used to learn a model of the unknown input-output stable plant. Then, a controller gated recurrent network is trained to approximate the model inverse. The stability of these networks, ensured by means of a suitable training procedure, allows to guarantee the input-output closed-loop stability. The proposed scheme is able to cope with the saturation of the control variables, and can be deployed on low-power embedded controllers, as it requires limited online computations. The approach is then tested on the Quadruple Tank benchmark system and compared to alternative control laws, resulting in remarkable closed-loop performances.
    Perturbated Gradients Updating within Unit Space for Deep Learning. (arXiv:2110.00199v2 [cs.LG] UPDATED)
    (0 min) In deep learning, optimization plays a vital role. By focusing on image classification, this work investigates the pros and cons of the widely used optimizers, and proposes a new optimizer: Perturbated Unit Gradient Descent (PUGD) algorithm with extending normalized gradient operation in tensor within perturbation to update in unit space. Via a set of experiments and analyses, we show that PUGD is locally bounded updating, which means the updating from time to time is controlled. On the other hand, PUGD can push models to a flat minimum, where the error remains approximately constant, not only because of the nature of avoiding stationary points in gradient normalization but also by scanning sharpness in the unit ball. From a series of rigorous experiments, PUGD helps models to gain a state-of-the-art Top-1 accuracy in Tiny ImageNet and competitive performances in CIFAR- {10, 100}. We open-source our code at link: https://github.com/hanktseng131415go/PUGD.
    VCG Mechanism Design with Unknown Agent Values under Stochastic Bandit Feedback. (arXiv:2004.08924v4 [stat.ML] UPDATED)
    (0 min) We study a multi-round welfare-maximising mechanism design problem in instances where agents do not know their values. On each round, a mechanism first assigns an allocation each to a set of agents and charges them a price; at the end of the round, the agents provide (stochastic) feedback to the mechanism for the allocation they received. This setting is motivated by applications in cloud markets and online advertising where an agent may know her value for an allocation only after experiencing it. Therefore, the mechanism needs to explore different allocations for each agent so that it can learn their values, while simultaneously attempting to find the socially optimal set of allocations. Our focus is on truthful and individually rational mechanisms which imitate the classical VCG mechanism in the long run. To that end, we first define three notions of regret for the welfare, the individual utilities of each agent and that of the mechanism. We show that these three terms are interdependent via an $\Omega(T^{\frac{2}{3}})$ lower bound for the maximum of these three terms after $T$ rounds of allocations, and describe an algorithm which essentially achieves this rate. Our framework also provides flexibility to control the pricing scheme so as to trade-off between the agent and seller regrets. Next, we define asymptotic variants for the truthfulness and individual rationality requirements and provide asymptotic rates to quantify the degree to which both properties are satisfied by the proposed algorithm.
    Optimal Transport for Unsupervised Denoising Learning. (arXiv:2108.02574v3 [eess.IV] UPDATED)
    (0 min) Recently, much progress has been made in unsupervised denoising learning. However, existing methods more or less rely on some assumptions on the signal and/or degradation model, which limits their practical performance. How to construct an optimal criterion for unsupervised denoising learning without any prior knowledge on the degradation model is still an open question. Toward answering this question, this work proposes a criterion for unsupervised denoising learning based on the optimal transport theory. This criterion has favorable properties, e.g., approximately maximal preservation of the information of the signal, whilst achieving perceptual reconstruction. Furthermore, though a relaxed unconstrained formulation is used in practical implementation, we prove that the relaxed formulation in theory has the same solution as the original constrained formulation. Experiments on synthetic and real-world data, including realistic photographic, microscopy, depth, and raw depth images, demonstrate that the proposed method even compares favorably with supervised methods, e.g., approaching the PSNR of supervised methods while having better perceptual quality. Particularly, for spatially correlated noise and realistic microscopy images, the proposed method not only achieves better perceptual quality but also has higher PSNR than supervised methods. Besides, it shows remarkable superiority in harsh practical conditions with complex noise, e.g., raw depth images. Code is available at https://github.com/wangweiSJTU/OTUR.
    Generative and Contrastive Self-Supervised Learning for Graph Anomaly Detection. (arXiv:2108.09896v2 [cs.LG] UPDATED)
    (0 min) Anomaly detection from graph data has drawn much attention due to its practical significance in many critical applications including cybersecurity, finance, and social networks. Existing data mining and machine learning methods are either shallow methods that could not effectively capture the complex interdependency of graph data or graph autoencoder methods that could not fully exploit the contextual information as supervision signals for effective anomaly detection. To overcome these challenges, in this paper, we propose a novel method, Self-Supervised Learning for Graph Anomaly Detection (SL-GAD). Our method constructs different contextual subgraphs (views) based on a target node and employs two modules, generative attribute regression and multi-view contrastive learning for anomaly detection. While the generative attribute regression module allows us to capture the anomalies in the attribute space, the multi-view contrastive learning module can exploit richer structure information from multiple subgraphs, thus abling to capture the anomalies in the structure space, mixing of structure, and attribute information. We conduct extensive experiments on six benchmark datasets and the results demonstrate that our method outperforms state-of-the-art methods by a large margin.
    Compression, Transduction, and Creation: A Unified Framework for Evaluating Natural Language Generation. (arXiv:2109.06379v2 [cs.CL] UPDATED)
    (0 min) Natural language generation (NLG) spans a broad range of tasks, each of which serves for specific objectives and desires different properties of generated text. The complexity makes automatic evaluation of NLG particularly challenging. Previous work has typically focused on a single task and developed individual evaluation metrics based on specific intuitions. In this paper, we propose a unifying perspective that facilitates the design of metrics for a wide range of language generation tasks and quality aspects. Based on the nature of information change from input to output, we classify NLG tasks into compression (e.g., summarization), transduction (e.g., text rewriting), and creation (e.g., dialog). The information alignment, or overlap, between input, context, and output text plays a common central role in characterizing the generation. Using the uniform concept of information alignment, we develop a family of interpretable metrics for various NLG tasks and aspects, often without need of gold reference data. To operationalize the metrics, we train self-supervised models to approximate information alignment as a prediction task. Experiments show the uniformly designed metrics achieve stronger or comparable correlations with human judgement compared to state-of-the-art metrics in each of diverse tasks, including text summarization, style transfer, and knowledge-grounded dialog. With information alignment as the intermediate representation, we deliver a composable library for easy NLG evaluation and future metric design.
    Scatter Correction in X-ray CT by Physics-Inspired Deep Learning. (arXiv:2103.11509v2 [physics.med-ph] UPDATED)
    (0 min) Scatter due to interaction of photons with the imaged object is a fundamental problem in X-ray Computed Tomography (CT). It manifests as various artifacts in the reconstruction, making its abatement or correction critical for image quality. Despite success in specific settings, hardware-based methods require modification in the hardware, or increase in the scan time or dose. This accounts for the great interest in software-based methods, including Monte-Carlo based scatter estimation, analytical-numerical, and kernel-based methods, with data-driven learning-based approaches demonstrated recently. In this work, two novel physics-inspired deep-learning-based methods, PhILSCAT and OV-PhILSCAT, are proposed. The methods estimate and correct for the scatter in the acquired projection measurements. Different from previous works, they incorporate both an initial reconstruction of the object of interest and the scatter-corrupted measurements related to it, and use a deep neural network architecture and cost function, both specifically tailored to the problem. Numerical experiments with data generated by Monte-Carlo simulations of the imaging of phantoms reveal consistent improvement over a recent purely projection-domain deep neural network scatter correction method.
    Learning Lipschitz Feedback Policies from Expert Demonstrations: Closed-Loop Guarantees, Generalization and Robustness. (arXiv:2103.16629v2 [cs.LG] UPDATED)
    (0 min) In this work, we propose a framework to learn feedback control policies with guarantees on closed-loop generalization and adversarial robustness. These policies are learned directly from expert demonstrations, contained in a dataset of state-control input pairs, without any prior knowledge of the task and system model. We use a Lipschitz-constrained loss minimization scheme to learn feedback policies with certified closed-loop robustness, wherein the Lipschitz constraint serves as a mechanism to tune the generalization performance and robustness to adversarial disturbances. Our analysis exploits the Lipschitz property to obtain closed-loop guarantees on generalization and robustness of the learned policies. In particular, we derive a finite sample bound on the policy learning error and establish robust closed-loop stability under the learned control policy. We also derive bounds on the closed-loop regret with respect to the expert policy and the deterioration of closed-loop performance under bounded (adversarial) disturbances to the state measurements. Numerical results validate our analysis and demonstrate the effectiveness of our robust feedback policy learning framework. Finally, our results suggest the existence of a potential tradeoff between nominal closed-loop performance and adversarial robustness, and that improvements in nominal closed-loop performance can only be made at the expense of robustness to adversarial perturbations.
    Learning Space Partitions for Path Planning. (arXiv:2106.10544v4 [cs.AI] UPDATED)
    (0 min) Path planning, the problem of efficiently discovering high-reward trajectories, often requires optimizing a high-dimensional and multimodal reward function. Popular approaches like CEM and CMA-ES greedily focus on promising regions of the search space and may get trapped in local maxima. DOO and VOOT balance exploration and exploitation, but use space partitioning strategies independent of the reward function to be optimized. Recently, LaMCTS empirically learns to partition the search space in a reward-sensitive manner for black-box optimization. In this paper, we develop a novel formal regret analysis for when and why such an adaptive region partitioning scheme works. We also propose a new path planning method LaP3 which improves the function value estimation within each sub-region, and uses a latent representation of the search space. Empirically, LaP3 outperforms existing path planning methods in 2D navigation tasks, especially in the presence of difficult-to-escape local optima, and shows benefits when plugged into the planning components of model-based RL such as PETS. These gains transfer to highly multimodal real-world tasks, where we outperform strong baselines in compiler phase ordering by up to 39% on average across 9 tasks, and in molecular design by up to 0.4 on properties on a 0-1 scale. Code is available at https://github.com/yangkevin2/neurips2021-lap3.
    From the Greene--Wu Convolution to Gradient Estimation over Riemannian Manifolds. (arXiv:2108.07406v4 [cs.LG] UPDATED)
    (0 min) Over a complete Riemannian manifold of finite dimension, Greene and Wu introduced a convolution, known as Greene-Wu (GW) convolution. In this paper, we study properties of the GW convolution and apply it to non-Euclidean machine learning problems. In particular, we derive a new formula for how the curvature of the space would affect the curvature of the function through the GW convolution. Also, following the study of the GW convolution, a new method for gradient estimation over Riemannian manifolds is introduced.
    Prediction-Free, Real-Time Flexible Control of Tidal Lagoons through Proximal Policy Optimisation: A Case Study for the Swansea Lagoon. (arXiv:2106.10360v3 [cs.LG] UPDATED)
    (0 min) Tidal Range Structures (TRS) have been considered for large-scale electricity generation for their potential ability to produce reasonably predictable energy without the emission of greenhouse gases. Once the main forcing components for driving the tides have deterministic dynamics, the available energy in a given TRS has been estimated, through analytical and numerical optimisation routines, as a mostly predictable event. This constraint imposes state-of-art flexible operation methods to rely on tidal predictions to infer best operational strategies for TRS, with the additional cost of requiring to run optimisation routines for every new tide. In this paper, a Deep Reinforcement Learning approach (Proximal Policy Optimisation through Unity ML-Agents) is introduced to perform automatic operation of TRS. For validation, the performance of the proposed method is compared with six different operation optimisation approaches devised from the literature, utilising the Swansea Bay Tidal Lagoon as a case study. We show that our approach is successful in maximising energy generation through an optimised operational policy of turbines and sluices, yielding competitive results with state-of-art optimisation strategies, with the clear advantages of requiring training once and performing real-time automatic control of TRS with measured ocean data only.
    Reinforcement Learning for Personalized Drug Discovery and Design for Complex Diseases: A Systems Pharmacology Perspective. (arXiv:2201.08894v1 [q-bio.BM])
    (2 min) Many multi-genic systematic diseases such as Alzheimer's disease and majority of cancers do not have effective treatments yet. Systems pharmacology is a potentially effective approach to designing personalized therapies for untreatable complexed diseases. In this article, we review the potential of reinforcement learning in systems pharmacology-oriented drug discovery and design. In spite of successful application of advanced reinforcement learning techniques to target-based drug discovery, new reinforcement learning techniques are needed to boost generalizability and transferability of reinforcement learning in partially observed and changing environments, optimize multi-objective reward functions for system-level molecular phenotype readouts and generalize predictive models for out-of-distribution data. A synergistic integration of reinforcement learning with other machine learning techniques and related fields such as biophysics and quantum computing is needed to achieve the ultimate goal of systems pharmacology-oriented de novo drug design for personalized medicine.
    Variational Autoencoder based Metamodeling for Multi-Objective Topology Optimization of Electrical Machines. (arXiv:2201.08877v1 [cs.LG])
    (2 min) Conventional magneto-static finite element analysis of electrical machine models is time-consuming and computationally expensive. Since each machine topology has a distinct set of parameters, design optimization is commonly performed independently. This paper presents a novel method for predicting Key Performance Indicators (KPIs) of differently parameterized electrical machine topologies at the same time by mapping a high dimensional integrated design parameters in a lower dimensional latent space using a variational autoencoder. After training, via a latent space, the decoder and multi-layer neural network will function as meta-models for sampling new designs and predicting associated KPIs, respectively. This enables parameter-based concurrent multi-topology optimization.
    HiSTGNN: Hierarchical Spatio-temporal Graph Neural Networks for Weather Forecasting. (arXiv:2201.09101v1 [cs.LG])
    (2 min) Weather Forecasting is an attractive challengeable task due to its influence on human life and complexity in atmospheric motion. Supported by massive historical observed time series data, the task is suitable for data-driven approaches, especially deep neural networks. Recently, the Graph Neural Networks (GNNs) based methods have achieved excellent performance for spatio-temporal forecasting. However, the canonical GNNs-based methods only individually model the local graph of meteorological variables per station or the global graph of whole stations, lacking information interaction between meteorological variables in different stations. In this paper, we propose a novel Hierarchical Spatio-Temporal Graph Neural Network (HiSTGNN) to model cross-regional spatio-temporal correlations among meteorological variables in multiple stations. An adaptive graph learning layer and spatial graph convolution are employed to construct self-learning graph and study hidden dependency among nodes of variable-level and station-level graph. For capturing temporal pattern, the dilated inception as the backbone of gate temporal convolution is designed to model long and various meteorological trends. Moreover, a dynamic interaction learning is proposed to build bidirectional information passing in hierarchical graph. Experimental results on three real-world meteorological datasets demonstrate the superior performance of HiSTGNN beyond 7 baselines and it reduces the errors by 4.2% to 11.6% especially compared to state-of-the-art weather forecasting method.
    Speech Recovery for Real-World Self-powered Intermittent Devices. (arXiv:2106.05229v2 [cs.SD] UPDATED)
    (0 min) The incompleteness of speech inputs severely degrades the performance of all the related speech signal processing applications. Although many researches have been proposed to address this issue, they controlled the data missing conditions by simulation with self-defined masking lengths or sizes. Besides, the masking definitions are different among all these experimental settings. This paper presents a novel intermittent speech recovery (ISR) system for real-world self-powered intermittent devices. Three contributive stages: interpolation, enhancement, and combination are applied to the ISR system for speech reconstruction. The experimental results show that our recovery system increases speech quality by up to 591.7%, while increasing speech intelligibility by up to 80.5%. Most importantly, the proposed ISR system improves the WER scores by up to 52.6%. The promising results not only confirm the effectiveness of the reconstruction but also encourage the utilization of these battery-free wearable/IoT devices.
    Efficient Representation for Electric Vehicle Charging Station Operations using Reinforcement Learning. (arXiv:2108.03236v2 [cs.LG] UPDATED)
    (0 min) Effectively operating electrical vehicle charging station (EVCS) is crucial for enabling the rapid transition of electrified transportation. To solve this problem using reinforcement learning (RL), the dimension of state/action spaces scales with the number of EVs and is thus very large and time-varying. This dimensionality issue affects the efficiency and convergence properties of generic RL algorithms. We develop aggregation schemes that are based on the emergency of EV charging, namely the laxity value. A least-laxity first (LLF) rule is adopted to consider only the total charging power of the EVCS which ensures the feasibility of individual EV schedules. In addition, we propose an equivalent state aggregation that can guarantee to attain the same optimal policy. Based on the proposed representation, policy gradient method is used to find the best parameters for the linear Gaussian policy . Numerical results have validated the performance improvement of the proposed representation approaches in attaining higher rewards and more effective policies as compared to existing approximation based approach.
    CTRMs: Learning to Construct Cooperative Timed Roadmaps for Multi-agent Path Planning in Continuous Spaces. (arXiv:2201.09467v1 [cs.MA])
    (0 min) Multi-agent path planning (MAPP) in continuous spaces is a challenging problem with significant practical importance. One promising approach is to first construct graphs approximating the spaces, called roadmaps, and then apply multi-agent pathfinding (MAPF) algorithms to derive a set of conflict-free paths. While conventional studies have utilized roadmap construction methods developed for single-agent planning, it remains largely unexplored how we can construct roadmaps that work effectively for multiple agents. To this end, we propose a novel concept of roadmaps called cooperative timed roadmaps (CTRMs). CTRMs enable each agent to focus on its important locations around potential solution paths in a way that considers the behavior of other agents to avoid inter-agent collisions (i.e., "cooperative"), while being augmented in the time direction to make it easy to derive a "timed" solution path. To construct CTRMs, we developed a machine-learning approach that learns a generative model from a collection of relevant problem instances and plausible solutions and then uses the learned model to sample the vertices of CTRMs for new, previously unseen problem instances. Our empirical evaluation revealed that the use of CTRMs significantly reduced the planning effort with acceptable overheads while maintaining a success rate and solution quality comparable to conventional roadmap construction approaches.
    Self-Supervised Multisensor Change Detection. (arXiv:2103.05102v3 [cs.CV] UPDATED)
    (0 min) Most change detection methods assume that pre-change and post-change images are acquired by the same sensor. However, in many real-life scenarios, e.g., natural disaster, it is more practical to use the latest available images before and after the occurrence of incidence, which may be acquired using different sensors. In particular, we are interested in the combination of the images acquired by optical and Synthetic Aperture Radar (SAR) sensors. SAR images appear vastly different from the optical images even when capturing the same scene. Adding to this, change detection methods are often constrained to use only target image-pair, no labeled data, and no additional unlabeled data. Such constraints limit the scope of traditional supervised machine learning and unsupervised generative approaches for multi-sensor change detection. Recent rapid development of self-supervised learning methods has shown that some of them can even work with only few images. Motivated by this, in this work we propose a method for multi-sensor change detection using only the unlabeled target bi-temporal images that are used for training a network in self-supervised fashion by using deep clustering and contrastive learning. The proposed method is evaluated on four multi-modal bi-temporal scenes showing change and the benefits of our self-supervised approach are demonstrated.
    Diverse Gaussian Noise Consistency Regularization for Robustness and Uncertainty Calibration under Noise Domain Shifts. (arXiv:2104.01231v3 [cs.LG] UPDATED)
    (0 min) Deep neural networks achieve high prediction accuracy when the train and test distributions coincide. In practice though, various types of corruptions occur which deviate from this setup and cause severe performance degradations. Few methods have been proposed to address generalization in the presence of unforeseen domain shifts. In particular, digital noise corruptions arise commonly in practice during the image acquisition stage and present a significant challenge for current robustness approaches. In this paper, we propose a diverse Gaussian noise consistency regularization method for improving robustness of image classifiers under a variety of noise corruptions while still maintaining high clean accuracy. We derive bounds to motivate our Gaussian noise consistency regularization using a local loss landscape analysis. We show that this simple approach improves robustness against various unforeseen noise corruptions over standard and adversarial training and other strong baselines. Furthermore, when combined with diverse data augmentation techniques we empirically show this type of consistency regularization further improves robustness and uncertainty calibration for common corruptions upon the state-of-the-art for several image classification benchmarks.
    Concave Utility Reinforcement Learning: the Mean-Field Game Viewpoint. (arXiv:2106.03787v3 [cs.LG] UPDATED)
    (0 min) Concave Utility Reinforcement Learning (CURL) extends RL from linear to concave utilities in the occupancy measure induced by the agent's policy. This encompasses not only RL but also imitation learning and exploration, among others. Yet, this more general paradigm invalidates the classical Bellman equations, and calls for new algorithms. Mean-field Games (MFGs) are a continuous approximation of many-agent RL. They consider the limit case of a continuous distribution of identical agents, anonymous with symmetric interests, and reduce the problem to the study of a single representative agent in interaction with the full population. Our core contribution consists in showing that CURL is a subclass of MFGs. We think this important to bridge together both communities. It also allows to shed light on aspects of both fields: we show the equivalence between concavity in CURL and monotonicity in the associated MFG, between optimality conditions in CURL and Nash equilibrium in MFG, or that Fictitious Play (FP) for this class of MFGs is simply Frank-Wolfe, bringing the first convergence rate for discrete-time FP for MFGs. We also experimentally demonstrate that, using algorithms recently introduced for solving MFGs, we can address the CURL problem more efficiently.
    Deep Learning on Attributed Sequences. (arXiv:2201.09199v1 [cs.LG])
    (0 min) Recent research in feature learning has been extended to sequence data, where each instance consists of a sequence of heterogeneous items with a variable length. However, in many real-world applications, the data exists in the form of attributed sequences, which is composed of a set of fixed-size attributes and variable-length sequences with dependencies between them. In the attributed sequence context, feature learning remains challenging due to the dependencies between sequences and their associated attributes. In this dissertation, we focus on analyzing and building deep learning models for four new problems on attributed sequences. Our extensive experiments on real-world datasets demonstrate that the proposed solutions significantly improve the performance of each task over the state-of-the-art methods on attributed sequences.
    POTHER: Patch-Voted Deep Learning-based Chest X-ray Bias Analysis for COVID-19 Detection. (arXiv:2201.09360v1 [eess.IV])
    (0 min) A critical step in the fight against COVID-19, which continues to have a catastrophic impact on peoples lives, is the effective screening of patients presented in the clinics with severe COVID-19 symptoms. Chest radiography is one of the promising screening approaches. Many studies reported detecting COVID-19 in chest X-rays accurately using deep learning. A serious limitation of many published approaches is insufficient attention paid to explaining decisions made by deep learning models. Using explainable artificial intelligence methods, we demonstrate that model decisions may rely on confounding factors rather than medical pathology. After an analysis of potential confounding factors found on chest X-ray images, we propose a novel method to minimise their negative impact. We show that our proposed method is more robust than previous attempts to counter confounding factors such as ECG leads in chest X-rays that often influence model classification decisions. In addition to being robust, our method achieves results comparable to the state-of-the-art. The source code and pre-trained weights are publicly available (https://github.com/tomek1911/POTHER).
    MISeval: a Metric Library for Medical Image Segmentation Evaluation. (arXiv:2201.09395v1 [cs.CV])
    (0 min) Correct performance assessment is crucial for evaluating modern artificial intelligence algorithms in medicine like deep-learning based medical image segmentation models. However, there is no universal metric library in Python for standardized and reproducible evaluation. Thus, we propose our open-source publicly available Python package MISeval: a metric library for Medical Image Segmentation Evaluation. The implemented metrics can be intuitively used and easily integrated into any performance assessment pipeline. The package utilizes modern CI/CD strategies to ensure functionality and stability. MISeval is available from PyPI (miseval) and GitHub: https://github.com/frankkramer-lab/miseval.
    End-to-End Neural Audio Coding for Real-Time Communications. (arXiv:2201.09429v1 [cs.SD])
    (0 min) Deep-learning based methods have shown their advantages inaudio coding over traditional ones but limited attention hasbeen paid on real-time communications (RTC). This paperproposes the TFNet, an end-to-end neural audio codec withlow latency for RTC. It takes an encoder-temporal filtering-decoder paradigm that seldom being investigated in audiocoding. An interleaved structure is proposed for temporalfiltering to capture both short-term and long-term temporaldependencies. Furthermore, with end-to-end optimization,the TFNet is jointly optimized with speech enhancement andpacket loss concealment, yielding a one-for-all network forthree tasks. Both subjective and objective results demonstratethe efficiency of the proposed TFNet.
    Backdoor Defense with Machine Unlearning. (arXiv:2201.09538v1 [cs.CR])
    (0 min) Backdoor injection attack is an emerging threat to the security of neural networks, however, there still exist limited effective defense methods against the attack. In this paper, we propose BAERASE, a novel method that can erase the backdoor injected into the victim model through machine unlearning. Specifically, BAERASE mainly implements backdoor defense in two key steps. First, trigger pattern recovery is conducted to extract the trigger patterns infected by the victim model. Here, the trigger pattern recovery problem is equivalent to the one of extracting an unknown noise distribution from the victim model, which can be easily resolved by the entropy maximization based generative model. Subsequently, BAERASE leverages these recovered trigger patterns to reverse the backdoor injection procedure and induce the victim model to erase the polluted memories through a newly designed gradient ascent based machine unlearning method. Compared with the previous machine unlearning solutions, the proposed approach gets rid of the reliance on the full access to training data for retraining and shows higher effectiveness on backdoor erasing than existing fine-tuning or pruning methods. Moreover, experiments show that BAERASE can averagely lower the attack success rates of three kinds of state-of-the-art backdoor attacks by 99\% on four benchmark datasets.
    A Bayesian regularization-backpropagation neural network model for peeling computations. (arXiv:2006.16409v3 [cs.CE] UPDATED)
    (0 min) Bayesian regularization-backpropagation neural network (BR-BPNN) model is employed to predict some aspects of the gecko spatula peeling viz. the variation of the maximum normal and tangential pull-off forces and the resultant force angle at detachment with the peeling angle. K-fold cross validation is used to improve the effectiveness of the model. The input data is taken from finite element (FE) peeling results. The neural network is trained with 75% of the FE dataset. The remaining 25% are utilized to predict the peeling behavior. The training performance is evaluated for every change in the number of hidden layer neurons to determine the optimal network structure. The relative error is calculated to draw a clear comparison between predicted and FE results. It is shown that the BR-BPNN model in conjunction with k-fold technique has significant potential to estimate the peeling behavior.
    Federated Unlearning with Knowledge Distillation. (arXiv:2201.09441v1 [cs.LG])
    (0 min) Federated Learning (FL) is designed to protect the data privacy of each client during the training process by transmitting only models instead of the original data. However, the trained model may memorize certain information about the training data. With the recent legislation on right to be forgotten, it is crucially essential for the FL model to possess the ability to forget what it has learned from each client. We propose a novel federated unlearning method to eliminate a client's contribution by subtracting the accumulated historical updates from the model and leveraging the knowledge distillation method to restore the model's performance without using any data from the clients. This method does not have any restrictions on the type of neural networks and does not rely on clients' participation, so it is practical and efficient in the FL system. We further introduce backdoor attacks in the training process to help evaluate the unlearning effect. Experiments on three canonical datasets demonstrate the effectiveness and efficiency of our method.
    Automated machine learning for secure key rate in discrete-modulated continuous-variable quantum key distribution. (arXiv:2201.09419v1 [quant-ph])
    (0 min) Continuous-variable quantum key distribution (CV QKD) with discrete modulation has attracted increasing attention due to its experimental simplicity, lower-cost implementation and compatibility with classical optical communication. Correspondingly, some novel numerical methods have been proposed to analyze the security of these protocols against collective attacks, which promotes key rates over one hundred kilometers of fiber distance. However, numerical methods are limited by their calculation time and resource consumption, for which they cannot play more roles on mobile platforms in quantum networks. To improve this issue, a neural network model predicting key rates in nearly real time has been proposed previously. Here, we go further and show a neural network model combined with Bayesian optimization. This model automatically designs the best architecture of neural network computing key rates in real time. We demonstrate our model with two variants of CV QKD protocols with quaternary modulation. The results show high reliability with secure probability as high as $99.15\%-99.59\%$, considerable tightness and high efficiency with speedup of approximately $10^7$ in both cases. This inspiring model enables the real-time computation of unstructured quantum key distribution protocols' key rate more automatically and efficiently, which has met the growing needs of implementing QKD protocols on moving platforms.
    AC-CovidNet: Attention Guided Contrastive CNN for Recognition of Covid-19 in Chest X-Ray Images. (arXiv:2105.10239v2 [eess.IV] UPDATED)
    (0 min) Covid-19 global pandemic continues to devastate health care systems across the world. At present, the Covid-19 testing is costly and time-consuming. Chest X-Ray (CXR) testing can be a fast, scalable, and non-invasive method. The existing methods suffer due to the limited CXR samples available from Covid-19. Thus, inspired by the limitations of the open-source work in this field, we propose attention guided contrastive CNN architecture (AC-CovidNet) for Covid-19 detection in CXR images. The proposed method learns the robust and discriminative features with the help of contrastive loss. Moreover, the proposed method gives more importance to the infected regions as guided by the attention mechanism. We compute the sensitivity of the proposed method over the publicly available Covid-19 dataset. It is observed that the proposed AC-CovidNet exhibits very promising performance as compared to the existing methods even with limited training data. It can tackle the bottleneck of CXR Covid-19 datasets being faced by the researchers. The code used in this paper is released publicly at \url{https://github.com/shivram1987/AC-CovidNet/}.
    Learning to Complete Code with Sketches. (arXiv:2106.10158v2 [cs.LG] UPDATED)
    (0 min) Code completion is usually cast as a language modelling problem, i.e., continuing an input in a left-to-right fashion. However, in practice, some parts of the completion (e.g., string literals) may be very hard to predict, whereas subsequent parts directly follow from the context. To handle this, we instead consider the scenario of generating code completions with "holes" inserted in places where a model is uncertain. We develop Grammformer, a Transformer-based model that guides code generation by the programming language grammar, and compare it to a variety of more standard sequence models. We train the models on code completion for C# and Python given partial code context. To evaluate models, we consider both ROUGE as well as a new metric RegexAcc that measures success of generating completions matching long outputs with as few holes as possible. In our experiments, Grammformer generates 10-50% more accurate completions compared to traditional generative models and 37-50% longer sketches compared to sketch-generating baselines trained with similar techniques.
    AttentionHTR: Handwritten Text Recognition Based on Attention Encoder-Decoder Networks. (arXiv:2201.09390v1 [cs.CV])
    (0 min) This work proposes an attention-based sequence-to-sequence model for handwritten word recognition and explores transfer learning for data-efficient training of HTR systems. To overcome training data scarcity, this work leverages models pre-trained on scene text images as a starting point towards tailoring the handwriting recognition models. ResNet feature extraction and bidirectional LSTM-based sequence modeling stages together form an encoder. The prediction stage consists of a decoder and a content-based attention mechanism. The effectiveness of the proposed end-to-end HTR system has been empirically evaluated on a novel multi-writer dataset Imgur5K and the IAM dataset. The experimental results evaluate the performance of the HTR framework, further supported by an in-depth analysis of the error cases. Source code and pre-trained models are available at https://github.com/dmitrijsk/AttentionHTR.
    Partition-Based Active Learning for Graph Neural Networks. (arXiv:2201.09391v1 [cs.LG])
    (0 min) We study the problem of semi-supervised learning with Graph Neural Networks (GNNs) in an active learning setup. We propose GraphPart, a novel partition-based active learning approach for GNNs. GraphPart first splits the graph into disjoint partitions and then selects representative nodes within each partition to query. The proposed method is motivated by a novel analysis of the classification error under realistic smoothness assumptions over the graph and the node features. Extensive experiments on multiple benchmark datasets demonstrate that the proposed method outperforms existing active learning methods for GNNs under a wide range of annotation budget constraints. In addition, the proposed method does not introduce additional hyperparameters, which is crucial for model training, especially in the active learning setting where a labeled validation set may not be available.
    Deep Spectrum Cartography: Completing Radio Map Tensors Using Learned Neural Models. (arXiv:2105.00177v2 [eess.SP] UPDATED)
    (0 min) The spectrum cartography (SC) technique constructs multi-domain (e.g., frequency, space, and time) radio frequency (RF) maps from limited measurements, which can be viewed as an ill-posed tensor completion problem. Model-based cartography techniques often rely on handcrafted priors (e.g., sparsity, smoothness and low-rank structures) for the completion task. Such priors may be inadequate to capture the essence of complex wireless environments -- especially when severe shadowing happens. To circumvent such challenges, offline-trained deep neural models of radio maps were considered for SC, as deep neural networks (DNNs) are able to "learn" intricate underlying structures from data. However, such deep learning (DL)-based SC approaches encounter serious challenges in both off-line model learning (training) and completion (generalization), possibly because the latent state space for generating the radio maps is prohibitively large. In this work, an emitter radio map disaggregation-based approach is proposed, under which only individual emitters' radio maps are modeled by DNNs. This way, the learning and generalization challenges can both be substantially alleviated. Using the learned DNNs, a fast nonnegative matrix factorization-based two-stage SC method and a performance-enhanced iterative optimization algorithm are proposed. Theoretical aspects -- such as recoverability of the radio tensor, sample complexity, and noise robustness -- under the proposed framework are characterized, and such theoretical properties have been elusive in the context of DL-based radio tensor completion. Experiments using synthetic and real-data from indoor and heavily shadowed environments are employed to showcase the effectiveness of the proposed methods.
    Learning Deep Graph Representations via Convolutional Neural Networks. (arXiv:2004.02131v2 [cs.LG] UPDATED)
    (0 min) Graph-structured data arise in many scenarios. A fundamental problem is to quantify the similarities of graphs for tasks such as classification. R-convolution graph kernels are positive-semidefinite functions that decompose graphs into substructures and compare them. One problem in the effective implementation of this idea is that the substructures are not independent, which leads to high-dimensional feature space. In addition, graph kernels cannot capture the high-order complex interactions between vertices. To mitigate these two problems, we propose a framework called DeepMap to learn deep representations for graph feature maps. The learned deep representation for a graph is a dense and low-dimensional vector that captures complex high-order interactions in a vertex neighborhood. DeepMap extends Convolutional Neural Networks (CNNs) to arbitrary graphs by generating aligned vertex sequences and building the receptive field for each vertex. We empirically validate DeepMap on various graph classification benchmarks and demonstrate that it achieves state-of-the-art performance.
    Understanding Instance-based Interpretability of Variational Auto-Encoders. (arXiv:2105.14203v4 [cs.LG] UPDATED)
    (0 min) Instance-based interpretation methods have been widely studied for supervised learning methods as they help explain how black box neural networks predict. However, instance-based interpretations remain ill-understood in the context of unsupervised learning. In this paper, we investigate influence functions [Koh and Liang, 2017], a popular instance-based interpretation method, for a class of deep generative models called variational auto-encoders (VAE). We formally frame the counter-factual question answered by influence functions in this setting, and through theoretical analysis, examine what they reveal about the impact of training samples on classical unsupervised learning methods. We then introduce VAE- TracIn, a computationally efficient and theoretically sound solution based on Pruthi et al. [2020], for VAEs. Finally, we evaluate VAE-TracIn on several real world datasets with extensive quantitative and qualitative analysis.
    Learning from Sparse Demonstrations. (arXiv:2008.02159v2 [cs.RO] UPDATED)
    (0 min) This paper develops the Continuous Pontryagin Differentiable Programming (Continuous PDP) method that enables a robot to learn a control utility function from a few number of sparsely demonstrated keyframes. The keyframes are few desired sequential outputs that a robot is wanted to follow at certain time instances. The duration of the keyframes may be different from that of the robot actual execution. The method jointly searches for a robot control utility function and a time-warping function such that the robot motion sequentially follows the given keyframes with minimal discrepancy loss. Continuous PDP minimizes the discrepancy loss using projected gradient descent, by efficiently solving the gradient of robot motion with respect to the unknown parameters. The method is first evaluated on a simulated two-link robot arm, and then applied to a 6-DoF maneuvering quadrotor to learn a utility function from keyframes for its motion planning in un-modeled environments with obstacles. The results show the efficiency of the method, its ability to handle time misalignment between keyframes and robot execution, and the generalization of the learned utility function into unseen motion conditions.
    Enhanced spatio-temporal electric load forecasts using less data with active deep learning. (arXiv:2012.04407v2 [cs.LG] UPDATED)
    (0 min) An effective way to oppose global warming and mitigate climate change is to electrify our energy sectors and supply their electric power from renewable wind and solar. Spatio-temporal predictions of electric load become increasingly important for planning this transition, while deep learning prediction models provide increasingly accurate predictions for it. The data used for training deep learning models, however, is usually collected at random using a passive learning approach. This naturally results in a large demand for data and associated costs for sensors like smart meters, posing a large barrier for electric utilities in decarbonizing their grids. Here, we test active learning where we leverage additional computation for collecting a more informative subset of data. We show how electric utilities can apply active learning to better distribute smart meters and collect their data for more accurate predictions of load with about half the data compared to when applying passive learning.
    Speaker Diarization with LSTM. (arXiv:1710.10468v7 [eess.AS] UPDATED)
    (0 min) For many years, i-vector based audio embedding techniques were the dominant approach for speaker verification and speaker diarization applications. However, mirroring the rise of deep learning in various domains, neural network based audio embeddings, also known as d-vectors, have consistently demonstrated superior speaker verification performance. In this paper, we build on the success of d-vector based speaker verification systems to develop a new d-vector based approach to speaker diarization. Specifically, we combine LSTM-based d-vector audio embeddings with recent work in non-parametric clustering to obtain a state-of-the-art speaker diarization system. Our system is evaluated on three standard public datasets, suggesting that d-vector based diarization systems offer significant advantages over traditional i-vector based systems. We achieved a 12.0% diarization error rate on NIST SRE 2000 CALLHOME, while our model is trained with out-of-domain data from voice search logs.
    Imposing Connectome-Derived Topology on an Echo State Network. (arXiv:2201.09359v1 [cs.NE])
    (0 min) Can connectome-derived constraints inform computation? In this paper we investigate the contribution of a fruit fly connectome's topology on the performance of an Echo State Network (ESN) -- a subset of Reservoir Computing which is state of the art in chaotic time series prediction. Specifically, we replace the reservoir layer of a classical ESN -- normally a fixed, random graph represented as a 2-d matrix -- with a particular (female) fruit fly connectome-derived connectivity matrix. We refer to this experimental class of models (with connectome-derived reservoirs) as "Fruit Fly ESNs" (FFESNs). We train and validate the FFESN on a chaotic time series prediction task; here we consider four sets of trials with different training input sizes (small, large) and train-validate splits (two variants). We compare the validation performance (Mean-Squared Error) of all of the best FFESN models to a class of control model ESNs (simply referred to as "ESNs"). Overall, for all four sets of trials we find that the FFESN either significantly outperforms (and has lower variance than) the ESN; or simply has lower variance than the ESN.
    Synthetic Data -- Anonymisation Groundhog Day. (arXiv:2011.07018v6 [cs.LG] UPDATED)
    (0 min) Synthetic data has been advertised as a silver-bullet solution to privacy-preserving data publishing that addresses the shortcomings of traditional anonymisation techniques. The promise is that synthetic data drawn from generative models preserves the statistical properties of the original dataset but, at the same time, provides perfect protection against privacy attacks. In this work, we present the first quantitative evaluation of the privacy gain of synthetic data publishing and compare it to that of previous anonymisation techniques. Our evaluation of a wide range of state-of-the-art generative models demonstrates that synthetic data either does not prevent inference attacks or does not retain data utility. In other words, we empirically show that synthetic data does not provide a better tradeoff between privacy and utility than traditional anonymisation techniques. Furthermore, in contrast to traditional anonymisation, the privacy-utility tradeoff of synthetic data publishing is hard to predict. Because it is impossible to predict what signals a synthetic dataset will preserve and what information will be lost, synthetic data leads to a highly variable privacy gain and unpredictable utility loss. In summary, we find that synthetic data is far from the holy grail of privacy-preserving data publishing.
    Tensor Completion by Multi-Rank via Unitary Transformation. (arXiv:2012.08784v2 [stat.ML] UPDATED)
    (0 min) One of the key problems in tensor completion is the number of uniformly random sample entries required for recovery guarantee. The main aim of this paper is to study $n_1 \times n_2 \times n_3$ third-order tensor completion based on transformed tensor singular value decomposition, and provide a bound on the number of required sample entries. Our approach is to make use of the multi-rank of the underlying tensor instead of its tubal rank in the bound. In numerical experiments on synthetic and imaging data sets, we demonstrate the effectiveness of our proposed bound for the number of sample entries. Moreover, our theoretical results are valid to any unitary transformation applied to $n_3$-dimension under transformed tensor singular value decomposition.
    Tensor Ring Parametrized Variational Quantum Circuits for Large Scale Quantum Machine Learning. (arXiv:2201.08878v1 [quant-ph])
    (2 min) Quantum Machine Learning (QML) is an emerging research area advocating the use of quantum computing for advancement in machine learning. Since the discovery of the capability of Parametrized Variational Quantum Circuits (VQC) to replace Artificial Neural Networks, they have been widely adopted to different tasks in Quantum Machine Learning. However, despite their potential to outperform neural networks, VQCs are limited to small scale applications given the challenges in scalability of quantum circuits. To address this shortcoming, we propose an algorithm that compresses the quantum state within the circuit using a tensor ring representation. Using the input qubit state in the tensor ring representation, single qubit gates maintain the tensor ring representation. However, the same is not true for two qubit gates in general, where an approximation is used to have the output as a tensor ring representation. Using this approximation, the storage and computational time increases linearly in the number of qubits and number of layers, as compared to the exponential increase with exact simulation algorithms. This approximation is used to implement the tensor ring VQC. The training of the parameters of tensor ring VQC is performed using a gradient descent based algorithm, where efficient approaches for backpropagation are used. The proposed approach is evaluated on two datasets: Iris and MNIST for the classification task to show the improved accuracy using more number of qubits. We achieve a test accuracy of 83.33\% on Iris dataset and a maximum of 99.30\% and 76.31\% on binary and ternary classification of MNIST dataset using various circuit architectures. The results from the IRIS dataset outperform the results on VQC implemented on Qiskit, and being scalable, demonstrates the potential for VQCs to be used for large scale Quantum Machine Learning applications.
    Universal Online Learning with Unbounded Losses: Memory Is All You Need. (arXiv:2201.08903v1 [stat.ML])
    (2 min) We resolve an open problem of Hanneke on the subject of universally consistent online learning with non-i.i.d. processes and unbounded losses. The notion of an optimistically universal learning rule was defined by Hanneke in an effort to study learning theory under minimal assumptions. A given learning rule is said to be optimistically universal if it achieves a low long-run average loss whenever the data generating process makes this goal achievable by some learning rule. Hanneke posed as an open problem whether, for every unbounded loss, the family of processes admitting universal learning are precisely those having a finite number of distinct values almost surely. In this paper, we completely resolve this problem, showing that this is indeed the case. As a consequence, this also offers a dramatically simpler formulation of an optimistically universal learning rule for any unbounded loss: namely, the simple memorization rule already suffices. Our proof relies on constructing random measurable partitions of the instance space and could be of independent interest for solving other open questions. We extend the results to the non-realizable setting thereby providing an optimistically universal Bayes consistent learning rule.
    Modality Bank: Learn multi-modality images across data centers without sharing medical data. (arXiv:2201.08955v1 [eess.IV])
    (2 min) Multi-modality images have been widely used and provide comprehensive information for medical image analysis. However, acquiring all modalities among all institutes is costly and often impossible in clinical settings. To leverage more comprehensive multi-modality information, we propose a privacy secured decentralized multi-modality adaptive learning architecture named ModalityBank. Our method could learn a set of effective domain-specific modulation parameters plugged into a common domain-agnostic network. We demonstrate by switching different sets of configurations, the generator could output high-quality images for a specific modality. Our method could also complete the missing modalities across all data centers, thus could be used for modality completion purposes. The downstream task trained from the synthesized multi-modality samples could achieve higher performance than learning from one real data center and achieve close-to-real performance compare with all real images.
    An integrated recurrent neural network and regression model with spatial and climatic couplings for vector-borne disease dynamics. (arXiv:2201.09394v1 [cs.LG])
    (0 min) We developed an integrated recurrent neural network and nonlinear regression spatio-temporal model for vector-borne disease evolution. We take into account climate data and seasonality as external factors that correlate with disease transmitting insects (e.g. flies), also spill-over infections from neighboring regions surrounding a region of interest. The climate data is encoded to the model through a quadratic embedding scheme motivated by recommendation systems. The neighboring regions' influence is modeled by a long short-term memory neural network. The integrated model is trained by stochastic gradient descent and tested on leish-maniasis data in Sri Lanka from 2013-2018 where infection outbreaks occurred. Our model outperformed ARIMA models across a number of regions with high infections, and an associated ablation study renders support to our modeling hypothesis and ideas.
    Spatio-Temporal Federated Learning for Massive Wireless Edge Networks. (arXiv:2110.14578v2 [cs.LG] UPDATED)
    (0 min) This paper presents a novel approach to conduct highly efficient federated learning (FL) over a massive wireless edge network, where an edge server and numerous mobile devices (clients) jointly learn a global model without transporting the huge amount of data collected by the mobile devices to the edge server. The proposed FL approach is referred to as spatio-temporal FL (STFL), which jointly exploits the spatial and temporal correlations between the learning updates from different mobile devices scheduled to join STFL in various training epochs. The STFL model not only represents the realistic intermittent learning behavior from the edge server to the mobile devices due to data delivery outage, but also features a mechanism of compensating loss learning updates in order to mitigate the impacts of intermittent learning. An analytical framework of STFL is proposed and employed to study the learning capability of STFL via its convergence performance. In particular, we have assessed the impact of data delivery outage, intermittent learning mitigation, and statistical heterogeneity of datasets on the convergence performance of STFL. The results provide crucial insights into the design and analysis of STFL-based wireless networks.
    Approximation bounds for norm constrained neural networks with applications to regression and GANs. (arXiv:2201.09418v1 [cs.LG])
    (0 min) This paper studies the approximation capacity of ReLU neural networks with norm constraint on the weights. We prove upper and lower bounds on the approximation error of these networks for smooth function classes. The lower bound is derived through the Rademacher complexity of neural networks, which may be of independent interest. We apply these approximation bounds to analyze the convergence of regression using norm constrained neural networks and distribution estimation by GANs. In particular, we obtain convergence rates for over-parameterized neural networks. It is also shown that GANs can achieve optimal rate of learning probability distributions, when the discriminator is a properly chosen norm constrained neural network.
    Bias in Automated Speaker Recognition. (arXiv:2201.09486v1 [cs.SD])
    (0 min) Automated speaker recognition uses data processing to identify speakers by their voice. Today, automated speaker recognition technologies are deployed on billions of smart devices and in services such as call centres. Despite their wide-scale deployment and known sources of bias in face recognition and natural language processing, bias in automated speaker recognition has not been studied systematically. We present an in-depth empirical and analytical study of bias in the machine learning development workflow of speaker verification, a voice biometric and core task in automated speaker recognition. Drawing on an established framework for understanding sources of harm in machine learning, we show that bias exists at every development stage in the well-known VoxCeleb Speaker Recognition Challenge, including model building, implementation, and data generation. Most affected are female speakers and non-US nationalities, who experience significant performance degradation. Leveraging the insights from our findings, we make practical recommendations for mitigating bias in automated speaker recognition, and outline future research directions.
    PaRT: Parallel Learning Towards Robust and Transparent AI. (arXiv:2201.09534v1 [cs.LG])
    (0 min) This paper takes a parallel learning approach for robust and transparent AI. A deep neural network is trained in parallel on multiple tasks, where each task is trained only on a subset of the network resources. Each subset consists of network segments, that can be combined and shared across specific tasks. Tasks can share resources with other tasks, while having independent task-related network resources. Therefore, the trained network can share similar representations across various tasks, while also enabling independent task-related representations. The above allows for some crucial outcomes. (1) The parallel nature of our approach negates the issue of catastrophic forgetting. (2) The sharing of segments uses network resources more efficiently. (3) We show that the network does indeed use learned knowledge from some tasks in other tasks, through shared representations. (4) Through examination of individual task-related and shared representations, the model offers transparency in the network and in the relationships across tasks in a multi-task setting. Evaluation of the proposed approach against complex competing approaches such as Continual Learning, Neural Architecture Search, and Multi-task learning shows that it is capable of learning robust representations. This is the first effort to train a DL model on multiple tasks in parallel. Our code is available at https://github.com/MahsaPaknezhad/PaRT
    A Novel Framework for Online Supervised Learning with Feature Selection. (arXiv:1803.11521v9 [stat.ML] UPDATED)
    (0 min) Current online learning methods suffer issues such as lower convergence rates and limited capability to recover the support of the true features compared to their offline counterparts. In this paper, we present a novel framework for online learning based on running averages and introduce a series of online versions of popular offline methods such as Elastic Net, Minimax Concave Penalty, and Feature Selection with Annealing. The framework can handle an arbitrarily large number of observations with the restriction that the data dimension is not too large, e.g. p<50,000. We prove the equivalence between our online methods and their offline counterparts and give theoretical true feature recovery and convergence guarantees for some of them. In contrast to existing online methods, the proposed methods can extract models with any desired sparsity level at any time. Numerical experiments indicate that our new methods enjoy high true feature recovery accuracy and a fast convergence rate, compared with standard online and offline algorithms. We also show how the running averages framework can be used for model adaptation in the presence of model drift. Finally, we present applications to large datasets where again the proposed framework shows competitive results compared to popular online and offline algorithms.
    Exploring Autoencoder-based Error-bounded Compression for Scientific Data. (arXiv:2105.11730v5 [cs.LG] UPDATED)
    (0 min) Error-bounded lossy compression is becoming an indispensable technique for the success of today's scientific projects with vast volumes of data produced during the simulations or instrument data acquisitions. Not only can it significantly reduce data size, but it also can control the compression errors based on user-specified error bounds. Autoencoder (AE) models have been widely used in image compression, but few AE-based compression approaches support error-bounding features, which are highly required by scientific applications. To address this issue, we explore using convolutional autoencoders to improve error-bounded lossy compression for scientific data, with the following three key contributions. (1) We provide an in-depth investigation of the characteristics of various autoencoder models and develop an error-bounded autoencoder-based framework in terms of the SZ model. (2) We optimize the compression quality for main stages in our designed AE-based error-bounded compression framework, fine-tuning the block sizes and latent sizes and also optimizing the compression efficiency of latent vectors. (3) We evaluate our proposed solution using five real-world scientific datasets and comparing them with six other related works. Experiments show that our solution exhibits a very competitive compression quality from among all the compressors in our tests. In absolute terms, it can obtain a much better compression quality (100% ~ 800% improvement in compression ratio with the same data distortion) compared with SZ2.1 and ZFP in cases with a high compression ratio.
    Investigating Expressiveness of Transformer in Spectral Domain for Graphs. (arXiv:2201.09332v1 [cs.LG])
    (0 min) Transformers have been proven to be inadequate for graph representation learning. To understand this inadequacy, there is need to investigate if spectral analysis of transformer will reveal insights on its expressive power. Similar studies already established that spectral analysis of Graph neural networks (GNNs) provides extra perspectives on their expressiveness. In this work, we systematically study and prove the link between the spatial and spectral domain in the realm of the transformer. We further provide a theoretical analysis that the spatial attention mechanism in the transformer cannot effectively capture the desired frequency response, thus, inherently limiting its expressiveness in spectral space. Therefore, we propose FeTA, a framework that aims to perform attention over the entire graph spectrum analogous to the attention in spatial space. Empirical results suggest that FeTA provides homogeneous performance gain against vanilla transformer across all tasks on standard benchmarks and can easily be extended to GNN based models with low-pass characteristics (e.g., GAT). Furthermore, replacing the vanilla transformer model with FeTA in recently proposed position encoding schemes has resulted in comparable or better performance than transformer and GNN baselines.
    Multi-way Spectral Clustering of Augmented Multi-view Data through Deep Collective Matrix Tri-factorization. (arXiv:2009.05805v2 [cs.LG] UPDATED)
    (0 min) We present the first deep learning based architecture for collective matrix tri-factorization (DCMTF) of arbitrary collections of matrices, also known as augmented multi-view data. DCMTF can be used for multi-way spectral clustering of heterogeneous collections of relational data matrices to discover latent clusters in each input matrix, across both dimensions, as well as the strengths of association across clusters. The source code for DCMTF is available on our public repository: https://bitbucket.org/cdal/dcmtf_generic
    Fast MRI Reconstruction: How Powerful Transformers Are?. (arXiv:2201.09400v1 [eess.IV])
    (0 min) Magnetic resonance imaging (MRI) is a widely used non-radiative and non-invasive method for clinical interrogation of organ structures and metabolism, with an inherently long scanning time. Methods by k-space undersampling and deep learning based reconstruction have been popularised to accelerate the scanning process. This work focuses on investigating how powerful transformers are for fast MRI by exploiting and comparing different novel network architectures. In particular, a generative adversarial network (GAN) based Swin transformer (ST-GAN) was introduced for the fast MRI reconstruction. To further preserve the edge and texture information, edge enhanced GAN based Swin transformer (EESGAN) and texture enhanced GAN based Swin transformer (TES-GAN) were also developed, where a dual-discriminator GAN structure was applied. We compared our proposed GAN based transformers, standalone Swin transformer and other convolutional neural networks based based GAN model in terms of the evaluation metrics PSNR, SSIM and FID. We showed that transformers work well for the MRI reconstruction from different undersampling conditions. The utilisation of GAN's adversarial structure improves the quality of images reconstructed when undersampled for 30% or higher.
    An Infinite-Feature Extension for Bayesian ReLU Nets That Fixes Their Asymptotic Overconfidence. (arXiv:2010.02709v5 [cs.LG] UPDATED)
    (0 min) A Bayesian treatment can mitigate overconfidence in ReLU nets around the training data. But far away from them, ReLU Bayesian neural networks (BNNs) can still underestimate uncertainty and thus be asymptotically overconfident. This issue arises since the output variance of a BNN with finitely many features is quadratic in the distance from the data region. Meanwhile, Bayesian linear models with ReLU features converge, in the infinite-width limit, to a particular Gaussian process (GP) with a variance that grows cubically so that no asymptotic overconfidence can occur. While this may seem of mostly theoretical interest, in this work, we show that it can be used in practice to the benefit of BNNs. We extend finite ReLU BNNs with infinite ReLU features via the GP and show that the resulting model is asymptotically maximally uncertain far away from the data while the BNNs' predictive power is unaffected near the data. Although the resulting model approximates a full GP posterior, thanks to its structure, it can be applied \emph{post-hoc} to any pre-trained ReLU BNN at a low cost.
    A Critical Look at the Consistency of Causal Estimation With Deep Latent Variable Models. (arXiv:2102.06648v5 [cs.LG] UPDATED)
    (0 min) Using deep latent variable models in causal inference has attracted considerable interest recently, but an essential open question is their ability to yield consistent causal estimates. While they have demonstrated promising results and theory exists on some simple model formulations, we also know that causal effects are not even identifiable in general with latent variables. We investigate this gap between theory and empirical results with analytical considerations and extensive experiments under multiple synthetic and real-world data sets, using the causal effect variational autoencoder (CEVAE) as a case study. While CEVAE seems to work reliably under some simple scenarios, it does not estimate the causal effect correctly with a misspecified latent variable or a complex data distribution, as opposed to its original motivation. Hence, our results show that more attention should be paid to ensuring the correctness of causal estimates with deep latent variable models.
    Spectral, Probabilistic, and Deep Metric Learning: Tutorial and Survey. (arXiv:2201.09267v1 [stat.ML])
    (0 min) This is a tutorial and survey paper on metric learning. Algorithms are divided into spectral, probabilistic, and deep metric learning. We first start with the definition of distance metric, Mahalanobis distance, and generalized Mahalanobis distance. In spectral methods, we start with methods using scatters of data, including the first spectral metric learning, relevant methods to Fisher discriminant analysis, Relevant Component Analysis (RCA), Discriminant Component Analysis (DCA), and the Fisher-HSIC method. Then, large-margin metric learning, imbalanced metric learning, locally linear metric adaptation, and adversarial metric learning are covered. We also explain several kernel spectral methods for metric learning in the feature space. We also introduce geometric metric learning methods on the Riemannian manifolds. In probabilistic methods, we start with collapsing classes in both input and feature spaces and then explain the neighborhood component analysis methods, Bayesian metric learning, information theoretic methods, and empirical risk minimization in metric learning. In deep learning methods, we first introduce reconstruction autoencoders and supervised loss functions for metric learning. Then, Siamese networks and its various loss functions, triplet mining, and triplet sampling are explained. Deep discriminant analysis methods, based on Fisher discriminant analysis, are also reviewed. Finally, we introduce multi-modal deep metric learning, geometric metric learning by neural networks, and few-shot metric learning.
    Multiple Similarity Drug-Target Interaction Prediction with Random Walks and Matrix Factorization. (arXiv:2201.09508v1 [q-bio.QM])
    (0 min) The discovery of drug-target interactions (DTIs) is a very promising area of research with great potential. In general, the identification of reliable interactions among drugs and proteins can boost the development of effective pharmaceuticals. In this work, we leverage random walks and matrix factorization techniques towards DTI prediction. In particular, we take a multi-layered network perspective, where different layers correspond to different similarity metrics between drugs and targets. To fully take advantage of topology information captured in multiple views, we develop an optimization framework, called MDMF, for DTI prediction. The framework learns vector representations of drugs and targets that not only retain higher-order proximity across all hyper-layers and layer-specific local invariance, but also approximates the interactions with their inner product. Furthermore, we propose an ensemble method, called MDMF2A, which integrates two instantiations of the MDMF model that optimize surrogate losses of the area under the precision-recall curve (AUPR) and the area under the receiver operating characteristic curve (AUC), respectively. The empirical study on real-world DTI datasets shows that our method achieves significant improvement over current state-of-the-art approaches in four different settings. Moreover, the validation of highly ranked non-interacting pairs also demonstrates the potential of MDMF2A to discover novel DTIs.
    Scattering Networks on the Sphere for Scalable and Rotationally Equivariant Spherical CNNs. (arXiv:2102.02828v4 [cs.CV] UPDATED)
    (0 min) Convolutional neural networks (CNNs) constructed natively on the sphere have been developed recently and shown to be highly effective for the analysis of spherical data. While an efficient framework has been formulated, spherical CNNs are nevertheless highly computationally demanding; typically they cannot scale beyond spherical signals of thousands of pixels. We develop scattering networks constructed natively on the sphere that provide a powerful representational space for spherical data. Spherical scattering networks are computationally scalable and exhibit rotational equivariance, while their representational space is invariant to isometries and provides efficient and stable signal representations. By integrating scattering networks as an additional type of layer in the generalized spherical CNN framework, we show how they can be leveraged to scale spherical CNNs to the high-resolution data typical of many practical applications, with spherical signals of many tens of megapixels and beyond.
    Image features of a splashing drop on a solid surface extracted using a feedforward neural network. (arXiv:2201.09541v1 [physics.flu-dyn])
    (0 min) This article reports nonintuitive characteristic of a splashing drop on a solid surface discovered through extracting image features using a feedforward neural network (FNN). Ethanol of area-equivalent radius about 1.29 mm was dropped from impact heights ranging from 4 cm to 60 cm (splashing threshold 20 cm) and impacted on a hydrophilic surface. The images captured when half of the drop impacted the surface were labeled according to their outcome, splashing or nonsplashing, and were used to train an FNN. A classification accuracy higher than 96% was achieved. To extract the image features identified by the FNN for classification, the weight matrix of the trained FNN for identifying splashing drops was visualized. Remarkably, the visualization showed that the trained FNN identified the contour height of the main body of the impacting drop as an important characteristic differentiating between splashing and nonsplashing drops, which has not been reported in previous studies. This feature was found throughout the impact, even when one and three-quarters of the drop impacted the surface. To confirm the importance of this image feature, the FNN was retrained to classify using only the main body without checking for the presence of ejected secondary droplets. The accuracy was still higher than 82%, confirming that the contour height is an important feature distinguishing splashing from nonsplashing drops. Several aspects of drop impact are analyzed and discussed with the aim of identifying the possible mechanism underlying the difference in contour height between splashing and nonsplashing drops.
    Lazy FSCA for Unsupervised Variable Selection. (arXiv:2103.02687v2 [cs.LG] UPDATED)
    (0 min) Various unsupervised greedy selection methods have been proposed as computationally tractable approximations to the NP-hard subset selection problem. These methods rely on sequentially selecting the variables that best improve performance with respect to a selection criterion. Theoretical results exist that provide performance bounds and enable "lazy greedy" efficient implementations for selection criteria that satisfy a diminishing returns property known as submodularity. This has motivated the development of variable selection algorithms based on mutual information and frame potential. Recently, the authors introduced Forward Selection Component Analysis (FSCA) which uses variance explained as its selection criterion. While this criterion is not submodular, FSCA has been shown to be highly effective for applications such as measurement plan optimisation. In this paper a "lazy" implementation of the FSCA algorithm (L-FSCA) is proposed, which, although not equivalent to FSCA due to the absence of submodularity, has the potential to yield comparable performance while being up to an order of magnitude faster to compute. The efficacy of L-FSCA is demonstrated by performing a systematic comparison with FSCA and five other unsupervised variable selection methods from the literature using simulated and real-world case studies. Experimental results confirm that L-FSCA yields almost identical performance to FSCA while reducing computation time by between 22% and 94% for the case studies considered.
    Homotopic Policy Mirror Descent: Policy Convergence, Implicit Regularization, and Improved Sample Complexity. (arXiv:2201.09457v1 [cs.LG])
    (0 min) We propose the homotopic policy mirror descent (HPMD) method for solving discounted, infinite horizon MDPs with finite state and action space, and study its policy convergence. We report three properties that seem to be new in the literature of policy gradient methods: (1) The policy first converges linearly, then superlinearly with order $\gamma^{-2}$ to the set of optimal policies, after $\mathcal{O}(\log(1/\Delta^*))$ number of iterations, where $\Delta^*$ is defined via a gap quantity associated with the optimal state-action value function; (2) HPMD also exhibits last-iterate convergence, with the limiting policy corresponding exactly to the optimal policy with the maximal entropy for every state. No regularization is added to the optimization objective and hence the second observation arises solely as an algorithmic property of the homotopic policy gradient method. (3) For the stochastic HPMD method, we demonstrate a better than $\mathcal{O}(|\mathcal{S}| |\mathcal{A}| / \epsilon^2)$ sample complexity for small optimality gap $\epsilon$, when assuming a generative model for policy evaluation.
    RoMA: a Method for Neural Network Robustness Measurement and Assessment. (arXiv:2110.11088v3 [cs.LG] UPDATED)
    (0 min) Neural network models have become the leading solution for a large variety of tasks, such as classification, language processing, protein folding, and others. However, their reliability is heavily plagued by adversarial inputs: small input perturbations that cause the model to produce erroneous outputs. Adversarial inputs can occur naturally when the system's environment behaves randomly, even in the absence of a malicious adversary, and are a severe cause for concern when attempting to deploy neural networks within critical systems. In this paper, we present a new statistical method, called Robustness Measurement and Assessment (RoMA), which can measure the expected robustness of a neural network model. Specifically, RoMA determines the probability that a random input perturbation might cause misclassification. The method allows us to provide formal guarantees regarding the expected frequency of errors that a trained model will encounter after deployment. Our approach can be applied to large-scale, black-box neural networks, which is a significant advantage compared to recently proposed verification methods. We apply our approach in two ways: comparing the robustness of different models, and measuring how a model's robustness is affected by the magnitude of input perturbation. One interesting insight obtained through this work is that, in a classification network, different output labels can exhibit very different robustness levels. We term this phenomenon categorial robustness. Our ability to perform risk and robustness assessments on a categorial basis opens the door to risk mitigation, which may prove to be a significant step towards neural network certification in safety-critical applications.
    Neural Bellman-Ford Networks: A General Graph Neural Network Framework for Link Prediction. (arXiv:2106.06935v4 [cs.LG] UPDATED)
    (0 min) Link prediction is a very fundamental task on graphs. Inspired by traditional path-based methods, in this paper we propose a general and flexible representation learning framework based on paths for link prediction. Specifically, we define the representation of a pair of nodes as the generalized sum of all path representations, with each path representation as the generalized product of the edge representations in the path. Motivated by the Bellman-Ford algorithm for solving the shortest path problem, we show that the proposed path formulation can be efficiently solved by the generalized Bellman-Ford algorithm. To further improve the capacity of the path formulation, we propose the Neural Bellman-Ford Network (NBFNet), a general graph neural network framework that solves the path formulation with learned operators in the generalized Bellman-Ford algorithm. The NBFNet parameterizes the generalized Bellman-Ford algorithm with 3 neural components, namely INDICATOR, MESSAGE and AGGREGATE functions, which corresponds to the boundary condition, multiplication operator, and summation operator respectively. The NBFNet is very general, covers many traditional path-based methods, and can be applied to both homogeneous graphs and multi-relational graphs (e.g., knowledge graphs) in both transductive and inductive settings. Experiments on both homogeneous graphs and knowledge graphs show that the proposed NBFNet outperforms existing methods by a large margin in both transductive and inductive settings, achieving new state-of-the-art results.
    Simplest Streaming Trees. (arXiv:2110.08483v2 [cs.LG] UPDATED)
    (0 min) Decision forests, including random forests and gradient boosting trees, remain the leading machine learning methods for many real-world data problems, specifically on tabular data. However, current standard implementations only operate in batch mode, and therefore cannot incrementally update when more data arrive. Several previous works developed streaming trees and ensembles to overcome this limitation. Nonetheless, we found that those state-of-the-art algorithms suffer from a number of drawbacks, including performing very poorly on some problems and requiring a huge amount of memory on others. We therefore developed the simplest possible extension of decision trees we could think of: given new data, simply update existing trees by continuing to grow them, and replace some old trees with new ones to control the total number of trees. On three standard datasets, we illustrate that our approach, Stream Decision Forest (SDF), does not suffer from either of the aforementioned limitations. In a benchmark suite containing 72 classification problems (the OpenML-CC18 data suite), we illustrate that our approach often performs as well, and sometimes better even, than the batch mode random forest algorithm. Thus, we believe SDFs establish a simple standard for streaming trees and forests that could readily be applied to many real-world problems, including those with distribution drift and continual learning.
    Terra: Imperative-Symbolic Co-Execution of Imperative Deep Learning Programs. (arXiv:2201.09210v1 [cs.LG])
    (0 min) Imperative programming allows users to implement their deep neural networks (DNNs) easily and has become an essential part of recent deep learning (DL) frameworks. Recently, several systems have been proposed to combine the usability of imperative programming with the optimized performance of symbolic graph execution. Such systems convert imperative Python DL programs to optimized symbolic graphs and execute them. However, they cannot fully support the usability of imperative programming. For example, if an imperative DL program contains a Python feature with no corresponding symbolic representation (e.g., third-party library calls or unsupported dynamic control flows) they fail to execute the program. To overcome this limitation, we propose Terra, an imperative-symbolic co-execution system that can handle any imperative DL programs while achieving the optimized performance of symbolic graph execution. To achieve this, Terra builds a symbolic graph by decoupling DL operations from Python features. Then, Terra conducts the imperative execution to support all Python features, while delegating the decoupled operations to the symbolic execution. We evaluated the performance improvement and coverage of Terra with ten imperative DL programs for several DNN architectures. The results show that Terra can speed up the execution of all ten imperative DL programs, whereas AutoGraph, one of the state-of-the-art systems, fails to execute five of them.
    Learning Student-Friendly Teacher Networks for Knowledge Distillation. (arXiv:2102.07650v4 [cs.LG] UPDATED)
    (0 min) We propose a novel knowledge distillation approach to facilitate the transfer of dark knowledge from a teacher to a student. Contrary to most of the existing methods that rely on effective training of student models given pretrained teachers, we aim to learn the teacher models that are friendly to students and, consequently, more appropriate for knowledge transfer. In other words, at the time of optimizing a teacher model, the proposed algorithm learns the student branches jointly to obtain student-friendly representations. Since the main goal of our approach lies in training teacher models and the subsequent knowledge distillation procedure is straightforward, most of the existing knowledge distillation methods can adopt this technique to improve the performance of diverse student models in terms of accuracy and convergence speed. The proposed algorithm demonstrates outstanding accuracy in several well-known knowledge distillation techniques with various combinations of teacher and student models even in the case that their architectures are heterogeneous and there is no prior knowledge about student models at the time of training teacher networks.
    Uncertainty-aware deep learning methods for robust diabetic retinopathy classification. (arXiv:2201.09042v1 [cs.CV])
    (0 min) Automatic classification of diabetic retinopathy from retinal images has been widely studied using deep neural networks with impressive results. However, there is a clinical need for estimation of the uncertainty in the classifications, a shortcoming of modern neural networks. Recently, approximate Bayesian deep learning methods have been proposed for the task but the studies have only considered the binary referable/non-referable diabetic retinopathy classification applied to benchmark datasets. We present novel results by systematically investigating a clinical dataset and a clinically relevant 5-class classification scheme, in addition to benchmark datasets and the binary classification scheme. Moreover, we derive a connection between uncertainty measures and classifier risk, from which we develop a new uncertainty measure. We observe that the previously proposed entropy-based uncertainty measure generalizes to the clinical dataset on the binary classification scheme but not on the 5-class scheme, whereas our new uncertainty measure generalizes to the latter case.
    DeepScalper: A Risk-Aware Deep Reinforcement Learning Framework for Intraday Trading with Micro-level Market Embedding. (arXiv:2201.09058v1 [q-fin.TR])
    (0 min) Reinforcement learning (RL) techniques have shown great success in quantitative investment tasks, such as portfolio management and algorithmic trading. Especially, intraday trading is one of the most profitable and risky tasks because of the intraday behaviors of the financial market that reflect billions of rapidly fluctuating values. However, it is hard to apply existing RL methods to intraday trading due to the following three limitations: 1) overlooking micro-level market information (e.g., limit order book); 2) only focusing on local price fluctuation and failing to capture the overall trend of the whole trading day; 3) neglecting the impact of market risk. To tackle these limitations, we propose DeepScalper, a deep reinforcement learning framework for intraday trading. Specifically, we adopt an encoder-decoder architecture to learn robust market embedding incorporating both macro-level and micro-level market information. Moreover, a novel hindsight reward function is designed to provide the agent a long-term horizon for capturing the overall price trend. In addition, we propose a risk-aware auxiliary task by predicting future volatility, which helps the agent take market risk into consideration while maximizing profit. Finally, extensive experiments on two stock index futures and four treasury bond futures demonstrate that DeepScalper achieves significant improvement against many state-of-the-art approaches.
    CNN-based regularisation for CT image reconstructions. (arXiv:2201.09132v1 [eess.IV])
    (0 min) X-ray computed tomographic infrastructures are medical imaging modalities that rely on the acquisition of rays crossing examined objects while measuring their intensity decrease. Physical measurements are post-processed by mathematical reconstruction algorithms that may offer weaker or top-notch consistency guarantees on the computed volumetric field. Superior results are provided on the account of an abundance of low-noise measurements being supplied. Nonetheless, such a scanning process would expose the examined body to an undesirably large-intensity and long-lasting ionising radiation, imposing severe health risks. One main objective of the ongoing research is the reduction of the number of projections while keeping the quality performance stable. Due to the under-sampling, the noise occurring inherently because of photon-electron interactions is now supplemented by reconstruction artifacts. Recently, deep learning methods, especially fully convolutional networks have been extensively investigated and proven to be efficient in filtering such deviations. In this report algorithms are presented that take as input a slice of a low-quality reconstruction of the volume in question and aim to map it to the reconstruction that is considered ideal, the ground truth. Above that, the first system comprises two additional elements: firstly, it ensures the consistency with the measured sinogram, secondly it adheres to constraints proposed in classical compressive sampling theory. The second one, inspired by classical ways of solving the inverse problem of reconstruction, takes an iterative approach to regularise the hypothesis in the direction of the correct result.
    Differentially Private SGDA for Minimax Problems. (arXiv:2201.09046v1 [cs.LG])
    (0 min) Stochastic gradient descent ascent (SGDA) and its variants have been the workhorse for solving minimax problems. However, in contrast to the well-studied stochastic gradient descent (SGD) with differential privacy (DP) constraints, there is little work on understanding the generalization (utility) of SGDA with DP constraints. In this paper, we use the algorithmic stability approach to establish the generalization (utility) of DP-SGDA in different settings. In particular, for the convex-concave setting, we prove that the DP-SGDA can achieve an optimal utility rate in terms of the weak primal-dual population risk in both smooth and non-smooth cases. To our best knowledge, this is the first-ever-known result for DP-SGDA in the non-smooth case. We further provide its utility analysis in the nonconvex-strongly-concave setting which is the first-ever-known result in terms of the primal population risk. The convergence and generalization results for this nonconvex setting are new even in the non-private setting. Finally, numerical experiments are conducted to demonstrate the effectiveness of DP-SGDA for both convex and nonconvex cases.
    Joint Learning of Hierarchical Community Structure and Node Representations: An Unsupervised Approach. (arXiv:2201.09086v1 [cs.LG])
    (0 min) Graph representation learning has demonstrated improved performance in tasks such as link prediction and node classification across a range of domains. Research has shown that many natural graphs can be organized in hierarchical communities, leading to approaches that use these communities to improve the quality of node representations. However, these approaches do not take advantage of the learned representations to also improve the quality of the discovered communities and establish an iterative and joint optimization of representation learning and community discovery. In this work, we present Mazi, an algorithm that jointly learns the hierarchical community structure and the node representations of the graph in an unsupervised fashion. To account for the structure in the node representations, Mazi generates node representations at each level of the hierarchy, and utilizes them to influence the node representations of the original graph. Further, the communities at each level are discovered by simultaneously maximizing the modularity metric and minimizing the distance between the representations of a node and its community. Using multi-label node classification and link prediction tasks, we evaluate our method on a variety of synthetic and real-world graphs and demonstrate that Mazi outperforms other hierarchical and non-hierarchical methods.
    Data-Centric Machine Learning in Quantum Information Science. (arXiv:2201.09134v1 [quant-ph])
    (0 min) We propose a series of data-centric heuristics for improving the performance of machine learning systems when applied to problems in quantum information science. In particular, we consider how systematic engineering of training sets can significantly enhance the accuracy of pre-trained neural networks used for quantum state reconstruction without altering the underlying architecture. We find that it is not always optimal to engineer training sets to exactly match the expected distribution of a target scenario, and instead, performance can be further improved by biasing the training set to be slightly more mixed than the target. This is due to the heterogeneity in the number of free variables required to describe states of different purity, and as a result, overall accuracy of the network improves when training sets of a fixed size focus on states with the least constrained free variables. For further clarity, we also include a "toy model" demonstration of how spurious correlations can inadvertently enter synthetic data sets used for training, how the performance of systems trained with these correlations can degrade dramatically, and how the inclusion of even relatively few counterexamples can effectively remedy such problems.
    Dual-Flattening Transformers through Decomposed Row and Column Queries for Semantic Segmentation. (arXiv:2201.09139v1 [cs.CV])
    (0 min) It is critical to obtain high resolution features with long range dependency for dense prediction tasks such as semantic segmentation. To generate high-resolution output of size $H\times W$ from a low-resolution feature map of size $h\times w$ ($hw\ll HW$), a naive dense transformer incurs an intractable complexity of $\mathcal{O}(hwHW)$, limiting its application on high-resolution dense prediction. We propose a Dual-Flattening Transformer (DFlatFormer) to enable high-resolution output by reducing complexity to $\mathcal{O}(hw(H+W))$ that is multiple orders of magnitude smaller than the naive dense transformer. Decomposed queries are presented to retrieve row and column attentions tractably through separate transformers, and their outputs are combined to form a dense feature map at high resolution. To this end, the input sequence fed from an encoder is row-wise and column-wise flattened to align with decomposed queries by preserving their row and column structures, respectively. Row and column transformers also interact with each other to capture their mutual attentions with the spatial crossings between rows and columns. We also propose to perform attentions through efficient grouping and pooling to further reduce the model complexity. Extensive experiments on ADE20K and Cityscapes datasets demonstrate the superiority of the proposed dual-flattening transformer architecture with higher mIoUs.
    Revisiting Pooling through the Lens of Optimal Transport. (arXiv:2201.09191v1 [cs.LG])
    (0 min) Pooling is one of the most significant operations in many machine learning models and tasks, whose implementation, however, is often empirical in practice. In this paper, we develop a novel and solid algorithmic pooling framework through the lens of optimal transport. In particular, we demonstrate that most existing pooling methods are equivalent to solving some specializations of an unbalanced optimal transport (UOT) problem. Making the parameters of the UOT problem learnable, we unify most existing pooling methods in the same framework, and accordingly, propose a generalized pooling layer called \textit{UOT-Pooling} for neural networks. Moreover, we implement the UOT-Pooling with two different architectures, based on the Sinkhorn scaling algorithm and the Bregman ADMM algorithm, respectively, and study their stability and efficiency quantitatively. We test our UOT-Pooling layers in two application scenarios, including multi-instance learning (MIL) and graph embedding. For state-of-the-art models of these two tasks, we can improve their performance by replacing conventional pooling layers with our UOT-Pooling layers.
    Increasing the Cost of Model Extraction with Calibrated Proof of Work. (arXiv:2201.09243v1 [cs.CR])
    (0 min) In model extraction attacks, adversaries can steal a machine learning model exposed via a public API by repeatedly querying it and adjusting their own model based on obtained predictions. To prevent model stealing, existing defenses focus on detecting malicious queries, truncating, or distorting outputs, thus necessarily introducing a tradeoff between robustness and model utility for legitimate users. Instead, we propose to impede model extraction by requiring users to complete a proof-of-work before they can read the model's predictions. This deters attackers by greatly increasing (even up to 100x) the computational effort needed to leverage query access for model extraction. Since we calibrate the effort required to complete the proof-of-work to each query, this only introduces a slight overhead for regular users (up to 2x). To achieve this, our calibration applies tools from differential privacy to measure the information revealed by a query. Our method requires no modification of the victim model and can be applied by machine learning practitioners to guard their publicly exposed models against being easily stolen.
    Optimal Dynamic Regret in Proper Online Learning with Strongly Convex Losses and Beyond. (arXiv:2201.08905v1 [cs.LG])
    (0 min) We study the framework of universal dynamic regret minimization with strongly convex losses. We answer an open problem in Baby and Wang 2021 by showing that in a proper learning setup, Strongly Adaptive algorithms can achieve the near optimal dynamic regret of $\tilde O(d^{1/3} n^{1/3}\text{TV}[u_{1:n}]^{2/3} \vee d)$ against any comparator sequence $u_1,\ldots,u_n$ simultaneously, where $n$ is the time horizon and $\text{TV}[u_{1:n}]$ is the Total Variation of comparator. These results are facilitated by exploiting a number of new structures imposed by the KKT conditions that were not considered in Baby and Wang 2021 which also lead to other improvements over their results such as: (a) handling non-smooth losses and (b) improving the dimension dependence on regret. Further, we also derive near optimal dynamic regret rates for the special case of proper online learning with exp-concave losses and an $L_\infty$ constrained decision set.
    Weight Expansion: A New Perspective on Dropout and Generalization. (arXiv:2201.09209v1 [cs.LG])
    (0 min) While dropout is known to be a successful regularization technique, insights into the mechanisms that lead to this success are still lacking. We introduce the concept of \emph{weight expansion}, an increase in the signed volume of a parallelotope spanned by the column or row vectors of the weight covariance matrix, and show that weight expansion is an effective means of increasing the generalization in a PAC-Bayesian setting. We provide a theoretical argument that dropout leads to weight expansion and extensive empirical support for the correlation between dropout and weight expansion. To support our hypothesis that weight expansion can be regarded as an \emph{indicator} of the enhanced generalization capability endowed by dropout, and not just as a mere by-product, we have studied other methods that achieve weight expansion (resp.\ contraction), and found that they generally lead to an increased (resp.\ decreased) generalization ability. This suggests that dropout is an attractive regularizer, because it is a computationally cheap method for obtaining weight expansion. This insight justifies the role of dropout as a regularizer, while paving the way for identifying regularizers that promise improved generalization through weight expansion.
    A Generalized Weighted Optimization Method for Computational Learning and Inversion. (arXiv:2201.09223v1 [cs.LG])
    (0 min) The generalization capacity of various machine learning models exhibits different phenomena in the under- and over-parameterized regimes. In this paper, we focus on regression models such as feature regression and kernel regression and analyze a generalized weighted least-squares optimization method for computational learning and inversion with noisy data. The highlight of the proposed framework is that we allow weighting in both the parameter space and the data space. The weighting scheme encodes both a priori knowledge on the object to be learned and a strategy to weight the contribution of different data points in the loss function. Here, we characterize the impact of the weighting scheme on the generalization error of the learning method, where we derive explicit generalization errors for the random Fourier feature model in both the under- and over-parameterized regimes. For more general feature maps, error bounds are provided based on the singular values of the feature matrix. We demonstrate that appropriate weighting from prior knowledge can improve the generalization capability of the learned model.
    Survey and Systematization of 3D Object Detection Models and Methods. (arXiv:2201.09354v1 [cs.CV])
    (0 min) This paper offers a comprehensive survey of recent developments in 3D object detection covering the full pipeline from input data, over data representation and feature extraction to the actual detection modules. We include basic concepts, focus our survey on a broad spectrum of different approaches arising in the last ten years and propose a systematization which offers a practical framework to compare those approaches on the methods level.
    Towards Sustainable Deep Learning for Wireless Fingerprinting Localization. (arXiv:2201.09071v1 [cs.LG])
    (0 min) Location based services, already popular with end users, are now inevitably becoming part of new wireless infrastructures and emerging business processes. The increasingly popular Deep Learning (DL) artificial intelligence methods perform very well in wireless fingerprinting localization based on extensive indoor radio measurement data. However, with the increasing complexity these methods become computationally very intensive and energy hungry, both for their training and subsequent operation. Considering only mobile users, estimated to exceed 7.4billion by the end of 2025, and assuming that the networks serving these users will need to perform only one localization per user per hour on average, the machine learning models used for the calculation would need to perform 65*10^12 predictions per year. Add to this equation tens of billions of other connected devices and applications that rely heavily on more frequent location updates, and it becomes apparent that localization will contribute significantly to carbon emissions unless more energy-efficient models are developed and used. This motivated our work on a new DL-based architecture for indoor localization that is more energy efficient compared to related state-of-the-art approaches while showing only marginal performance degradation. A detailed performance evaluation shows that the proposed model producesonly 58 % of the carbon footprint while maintaining 98.7 % of the overall performance compared to state of the art model external to our group. Additionally, we elaborate on a methodology to calculate the complexity of the DL model and thus the CO2 footprint during its training and operation.
    Distributed Learning of Generalized Linear Causal Networks. (arXiv:2201.09194v1 [stat.ME])
    (0 min) We consider the task of learning causal structures from data stored on multiple machines, and propose a novel structure learning method called distributed annealing on regularized likelihood score (DARLS) to solve this problem. We model causal structures by a directed acyclic graph that is parameterized with generalized linear models, so that our method is applicable to various types of data. To obtain a high-scoring causal graph, DARLS simulates an annealing process to search over the space of topological sorts, where the optimal graphical structure compatible with a sort is found by a distributed optimization method. This distributed optimization relies on multiple rounds of communication between local and central machines to estimate the optimal structure. We establish its convergence to a global optimizer of the overall score that is computed on all data across local machines. To the best of our knowledge, DARLS is the first distributed method for learning causal graphs with such theoretical guarantees. Through extensive simulation studies, DARLS has shown competing performance against existing methods on distributed data, and achieved comparable structure learning accuracy and test-data likelihood with competing methods applied to pooled data across all local machines. In a real-world application for modeling protein-DNA binding networks with distributed ChIP-Sequencing data, DARLS also exhibits higher predictive power than other methods, demonstrating a great advantage in estimating causal networks from distributed data.
    SpiroMask: Measuring Lung Function Using Consumer-Grade Masks. (arXiv:2201.09280v1 [cs.HC])
    (0 min) According to the World Health Organisation (WHO), 235 million people suffer from respiratory illnesses and four million deaths annually. Regular lung health monitoring can lead to prognoses about deteriorating lung health conditions. This paper presents our system SpiroMask that retrofits a microphone in consumer-grade masks (N95 and cloth masks) for continuous lung health monitoring. We evaluate our approach on 48 participants (including 14 with lung health issues) and find that we can estimate parameters such as lung volume and respiration rate within the approved error range by the American Thoracic Society (ATS). Further, we show that our approach is robust to sensor placement inside the mask.
    Bag of Tricks for Natural Policy Gradient Reinforcement Learning. (arXiv:2201.09104v1 [cs.LG])
    (0 min) Natural policy gradient methods are popular reinforcement learning methods that improve the stability of policy gradient methods by preconditioning the gradient with the inverse of the Fisher-information matrix. However, leveraging natural policy gradient methods in an optimal manner can be very challenging as many implementation details must be set to achieve optimal performance. To the best of the authors' knowledge, there has not been a study that has investigated strategies for setting these details for natural policy gradient methods to achieve high performance in a comprehensive and systematic manner. To address this, we have implemented and compared strategies that impact performance in natural policy gradient reinforcement learning across five different second-order approximations. These include varying batch sizes and optimizing the critic network using the natural gradient. Furthermore, insights about the fundamental trade-offs when optimizing for performance (stability, sample efficiency, and computation time) were generated. Experimental results indicate that the proposed collection of strategies for performance optimization can improve results by 86% to 181% across the MuJuCo control benchmark, with TENGraD exhibiting the best approximation performance amongst the tested approximations. Code in this study is available at https://github.com/gebob19/natural-policy-gradient-reinforcement-learning.
    Question rewriting? Assessing its importance for conversational question answering. (arXiv:2201.09146v1 [cs.CL])
    (0 min) In conversational question answering, systems must correctly interpret the interconnected interactions and generate knowledgeable answers, which may require the retrieval of relevant information from a background repository. Recent approaches to this problem leverage neural language models, although different alternatives can be considered in terms of modules for (a) representing user questions in context, (b) retrieving the relevant background information, and (c) generating the answer. This work presents a conversational question answering system designed specifically for the Search-Oriented Conversational AI (SCAI) shared task, and reports on a detailed analysis of its question rewriting module. In particular, we considered different variations of the question rewriting module to evaluate the influence on the subsequent components, and performed a careful analysis of the results obtained with the best system configuration. Our system achieved the best performance in the shared task and our analysis emphasizes the importance of the conversation context representation for the overall system performance.
    pvCNN: Privacy-Preserving and Verifiable Convolutional Neural Network Testing. (arXiv:2201.09186v1 [cs.CR])
    (0 min) This paper proposes a new approach for privacy-preserving and verifiable convolutional neural network (CNN) testing, enabling a CNN model developer to convince a user of the truthful CNN performance over non-public data from multiple testers, while respecting model privacy. To balance the security and efficiency issues, three new efforts are done by appropriately integrating homomorphic encryption (HE) and zero-knowledge succinct non-interactive argument of knowledge (zk-SNARK) primitives with the CNN testing. First, a CNN model to be tested is strategically partitioned into a private part kept locally by the model developer, and a public part outsourced to an outside server. Then, the private part runs over HE-protected test data sent by a tester and transmits its outputs to the public part for accomplishing subsequent computations of the CNN testing. Second, the correctness of the above CNN testing is enforced by generating zk-SNARK based proofs, with an emphasis on optimizing proving overhead for two-dimensional (2-D) convolution operations, since the operations dominate the performance bottleneck during generating proofs. We specifically present a new quadratic matrix programs (QMPs)-based arithmetic circuit with a single multiplication gate for expressing 2-D convolution operations between multiple filters and inputs in a batch manner. Third, we aggregate multiple proofs with respect to a same CNN model but different testers' test data (i.e., different statements) into one proof, and ensure that the validity of the aggregated proof implies the validity of the original multiple proofs. Lastly, our experimental results demonstrate that our QMPs-based zk-SNARK performs nearly 13.9$\times$faster than the existing QAPs-based zk-SNARK in proving time, and 17.6$\times$faster in Setup time, for high-dimension matrix multiplication.
    Differential Geometry in Neural Implicits. (arXiv:2201.09263v1 [cs.GR])
    (0 min) We introduce a neural implicit framework that bridges discrete differential geometry of triangle meshes and continuous differential geometry of neural implicit surfaces. It exploits the differentiable properties of neural networks and the discrete geometry of triangle meshes to approximate them as the zero-level sets of neural implicit functions. To train a neural implicit function, we propose a loss function that allows terms with high-order derivatives, such as the alignment between the principal directions, to learn more geometric details. During training, we consider a non-uniform sampling strategy based on the discrete curvatures of the triangle mesh to access points with more geometric details. This sampling implies faster learning while preserving geometric accuracy. We present the analytical differential geometry formulas for neural surfaces, such as normal vectors and curvatures. We use them to render the surfaces using sphere tracing. Additionally, we propose a network optimization based on singular value decomposition to reduce the number of parameters.
    Visual Representation Learning with Self-Supervised Attention for Low-Label High-data Regime. (arXiv:2201.08951v1 [cs.CV])
    (0 min) Self-supervision has shown outstanding results for natural language processing, and more recently, for image recognition. Simultaneously, vision transformers and its variants have emerged as a promising and scalable alternative to convolutions on various computer vision tasks. In this paper, we are the first to question if self-supervised vision transformers (SSL-ViTs) can be adapted to two important computer vision tasks in the low-label, high-data regime: few-shot image classification and zero-shot image retrieval. The motivation is to reduce the number of manual annotations required to train a visual embedder, and to produce generalizable, semantically meaningful and robust embeddings. For few-shot image classification we train SSL-ViTs without any supervision, on external data, and use this trained embedder to adapt quickly to novel classes with limited number of labels. For zero-shot image retrieval, we use SSL-ViTs pre-trained on a large dataset without any labels and fine-tune them with several metric learning objectives. Our self-supervised attention representations outperforms the state-of-the-art on several public benchmarks for both tasks, namely miniImageNet and CUB200 for few-shot image classification by up-to 6%-10%, and Stanford Online Products, Cars196 and CUB200 for zero-shot image retrieval by up-to 4%-11%. Code is available at \url{https://github.com/AutoVision-cloud/SSL-ViT-lowlabel-highdata}.
    A Causal Lens for Controllable Text Generation. (arXiv:2201.09119v1 [cs.CL])
    (0 min) Controllable text generation concerns two fundamental tasks of wide applications, namely generating text of given attributes (i.e., attribute-conditional generation), and minimally editing existing text to possess desired attributes (i.e., text attribute transfer). Extensive prior work has largely studied the two problems separately, and developed different conditional models which, however, are prone to producing biased text (e.g., various gender stereotypes). This paper proposes to formulate controllable text generation from a principled causal perspective which models the two tasks with a unified framework. A direct advantage of the causal formulation is the use of rich causality tools to mitigate generation biases and improve control. We treat the two tasks as interventional and counterfactual causal inference based on a structural causal model, respectively. We then apply the framework to the challenging practical setting where confounding factors (that induce spurious correlations) are observable only on a small fraction of data. Experiments show significant superiority of the causal approach over previous conditional models for improved control accuracy and reduced bias.
    Sketch2PQ: Freeform Planar Quadrilateral Mesh Design via a Single Sketch. (arXiv:2201.09367v1 [cs.GR])
    (0 min) The freeform architectural modeling process often involves two important stages: concept design and digital modeling. In the first stage, architects usually sketch the overall 3D shape and the panel layout on a physical or digital paper briefly. In the second stage, a digital 3D model is created using the sketching as the reference. The digital model needs to incorporate geometric requirements for its components, such as planarity of panels due to consideration of construction costs, which can make the modeling process more challenging. In this work, we present a novel sketch-based system to bridge the concept design and digital modeling of freeform roof-like shapes represented as planar quadrilateral (PQ) meshes. Our system allows the user to sketch the surface boundary and contour lines under axonometric projection and supports the sketching of occluded regions. In addition, the user can sketch feature lines to provide directional guidance to the PQ mesh layout. Given the 2D sketch input, we propose a deep neural network to infer in real-time the underlying surface shape along with a dense conjugate direction field, both of which are used to extract the final PQ mesh. To train and validate our network, we generate a large synthetic dataset that mimics architect sketching of freeform quadrilateral patches. The effectiveness and usability of our system are demonstrated with quantitative and qualitative evaluation as well as user studies.
    Are Your Sensitive Attributes Private? Novel Model Inversion Attribute Inference Attacks on Classification Models. (arXiv:2201.09370v1 [cs.CR])
    (0 min) Increasing use of machine learning (ML) technologies in privacy-sensitive domains such as medical diagnoses, lifestyle predictions, and business decisions highlights the need to better understand if these ML technologies are introducing leakage of sensitive and proprietary training data. In this paper, we focus on model inversion attacks where the adversary knows non-sensitive attributes about records in the training data and aims to infer the value of a sensitive attribute unknown to the adversary, using only black-box access to the target classification model. We first devise a novel confidence score-based model inversion attribute inference attack that significantly outperforms the state-of-the-art. We then introduce a label-only model inversion attack that relies only on the model's predicted labels but still matches our confidence score-based attack in terms of attack effectiveness. We also extend our attacks to the scenario where some of the other (non-sensitive) attributes of a target record are unknown to the adversary. We evaluate our attacks on two types of machine learning models, decision tree and deep neural network, trained on three real datasets. Moreover, we empirically demonstrate the disparate vulnerability of model inversion attacks, i.e., specific groups in the training dataset (grouped by gender, race, etc.) could be more vulnerable to model inversion attacks.
    An Application of Pseudo-Log-Likelihoods to Natural Language Scoring. (arXiv:2201.09377v1 [cs.CL])
    (0 min) Language models built using semi-supervised machine learning on large corpora of natural language have very quickly enveloped the fields of natural language generation and understanding. In this paper we apply a zero-shot approach independently developed by a number of researchers now gaining recognition as a significant alternative to fine-tuning for evaluation on common sense tasks. A language model with relatively few parameters and training steps compared to a more recent language model (T5) can outperform it on a recent large data set (TimeDial), while displaying robustness in its performance across a similar class of language tasks. Surprisingly, this result is achieved by using a hyperparameter-free zero-shot method with the smaller model, compared to fine-tuning to the larger model. We argue that robustness of the smaller model ought to be understood in terms of compositionality, in a sense that we draw from recent literature on a class of similar models. We identify a practical cost for our method and model: high GPU-time for natural language evaluation. The zero-shot measurement technique that produces remarkable stability, both for ALBERT and other BERT variants, is an application of pseudo-log-likelihoods to masked language models for the relative measurement of probability for substitution alternatives in forced choice language tasks such as the Winograd Schema Challenge, Winogrande, and others. One contribution of this paper is to bring together a number of similar, but independent strands of research. We produce some absolute state-of-the-art results for common sense reasoning in binary choice tasks, performing better than any published result in the literature, including fine-tuned efforts. We show a remarkable consistency of the model's performance under adversarial settings, which we argue is best explained by the model's compositionality of representations.
    Learning to Minimize the Remainder in Supervised Learning. (arXiv:2201.09193v1 [cs.CV])
    (0 min) The learning process of deep learning methods usually updates the model's parameters in multiple iterations. Each iteration can be viewed as the first-order approximation of Taylor's series expansion. The remainder, which consists of higher-order terms, is usually ignored in the learning process for simplicity. This learning scheme empowers various multimedia based applications, such as image retrieval, recommendation system, and video search. Generally, multimedia data (e.g., images) are semantics-rich and high-dimensional, hence the remainders of approximations are possibly non-zero. In this work, we consider the remainder to be informative and study how it affects the learning process. To this end, we propose a new learning approach, namely gradient adjustment learning (GAL), to leverage the knowledge learned from the past training iterations to adjust vanilla gradients, such that the remainders are minimized and the approximations are improved. The proposed GAL is model- and optimizer-agnostic, and is easy to adapt to the standard learning framework. It is evaluated on three tasks, i.e., image classification, object detection, and regression, with state-of-the-art models and optimizers. The experiments show that the proposed GAL consistently enhances the evaluated models, whereas the ablation studies validate various aspects of the proposed GAL. The code is available at \url{https://github.com/luoyan407/gradient_adjustment.git}.
    Communication-Efficient Stochastic Zeroth-Order Optimization for Federated Learning. (arXiv:2201.09531v1 [cs.LG])
    (0 min) Federated learning (FL), as an emerging edge artificial intelligence paradigm, enables many edge devices to collaboratively train a global model without sharing their private data. To enhance the training efficiency of FL, various algorithms have been proposed, ranging from first-order to second-order methods. However, these algorithms cannot be applied in scenarios where the gradient information is not available, e.g., federated black-box attack and federated hyperparameter tuning. To address this issue, in this paper we propose a derivative-free federated zeroth-order optimization (FedZO) algorithm featured by performing multiple local updates based on stochastic gradient estimators in each communication round and enabling partial device participation. Under the non-convex setting, we derive the convergence performance of the FedZO algorithm and characterize the impact of the numbers of local iterates and participating edge devices on the convergence. To enable communication-efficient FedZO over wireless networks, we further propose an over-the-air computation (AirComp) assisted FedZO algorithm. With an appropriate transceiver design, we show that the convergence of AirComp-assisted FedZO can still be preserved under certain signal-to-noise ratio conditions. Simulation results demonstrate the effectiveness of the FedZO algorithm and validate the theoretical observations.
    Fast Transient Stability Prediction Using Grid-informed Temporal and Topological Embedding Deep Neural Network. (arXiv:2201.09245v1 [eess.SY])
    (0 min) Transient stability prediction is critically essential to the fast online assessment and maintaining the stable operation in power systems. The wide deployment of phasor measurement units (PMUs) promotes the development of data-driven approaches for transient stability assessment. This paper proposes the temporal and topological embedding deep neural network (TTEDNN) model to forecast transient stability with the early transient dynamics. The TTEDNN model can accurately and efficiently predict the transient stability by extracting the temporal and topological features from the time-series data of the early transient dynamics. The grid-informed adjacency matrix is used to incorporate the power grid structural and electrical parameter information. The transient dynamics simulation environments under the single-node and multiple-node perturbations are used to test the performance of the TTEDNN model for the IEEE 39-bus and IEEE 118-bus power systems. The results show that the TTEDNN model has the best and most robust prediction performance. Furthermore, the TTEDNN model also demonstrates the transfer capability to predict the transient stability in the more complicated transient dynamics simulation environments.
    Optimal transport for causal discovery. (arXiv:2201.09366v1 [cs.LG])
    (0 min) Approaches based on Functional Causal Models (FCMs) have been proposed to determine causal direction between two variables, by properly restricting model classes; however, their performance is sensitive to the model assumptions, which makes it difficult for practitioners to use. In this paper, we provide a novel dynamical-system view of FCMs and propose a new framework for identifying causal direction in the bivariate case. We first show the connection between FCMs and optimal transport, and then study optimal transport under the constraints of FCMs. Furthermore, by exploiting the dynamical interpretation of optimal transport under the FCM constraints, we determine the corresponding underlying dynamical process of the static cause-effect pair data under the least action principle. It provides a new dimension for describing static causal discovery tasks, while enjoying more freedom for modeling the quantitative causal influences. In particular, we show that Additive Noise Models (ANMs) correspond to volume-preserving pressureless flows. Consequently, based on their velocity field divergence, we introduce a criterion to determine causal direction. With this criterion, we propose a novel optimal transport-based algorithm for ANMs which is robust to the choice of models and extend it to post-noninear models. Our method demonstrated state-of-the-art results on both synthetic and causal discovery benchmark datasets.
    Predicting Physics in Mesh-reduced Space with Temporal Attention. (arXiv:2201.09113v1 [cs.LG])
    (0 min) Graph-based next-step prediction models have recently been very successful in modeling complex high-dimensional physical systems on irregular meshes. However, due to their short temporal attention span, these models suffer from error accumulation and drift. In this paper, we propose a new method that captures long-term dependencies through a transformer-style temporal attention model. We introduce an encoder-decoder structure to summarize features and create a compact mesh representation of the system state, to allow the temporal model to operate on a low-dimensional mesh representations in a memory efficient manner. Our method outperforms a competitive GNN baseline on several complex fluid dynamics prediction tasks, from sonic shocks to vascular flow. We demonstrate stable rollouts without the need for training noise and show perfectly phase-stable predictions even for very long sequences. More broadly, we believe our approach paves the way to bringing the benefits of attention-based sequence models to solving high-dimensional complex physics tasks.
    One-Shot Learning on Attributed Sequences. (arXiv:2201.09202v1 [cs.LG])
    (0 min) One-shot learning has become an important research topic in the last decade with many real-world applications. The goal of one-shot learning is to classify unlabeled instances when there is only one labeled example per class. Conventional problem setting of one-shot learning mainly focuses on the data that is already in feature space (such as images). However, the data instances in real-world applications are often more complex and feature vectors may not be available. In this paper, we study the problem of one-shot learning on attributed sequences, where each instance is composed of a set of attributes (e.g., user profile) and a sequence of categorical items (e.g., clickstream). This problem is important for a variety of real-world applications ranging from fraud prevention to network intrusion detection. This problem is more challenging than conventional one-shot learning since there are dependencies between attributes and sequences. We design a deep learning framework OLAS to tackle this problem. The proposed OLAS utilizes a twin network to generalize the features from pairwise attributed sequence examples. Empirical results on real-world datasets demonstrate the proposed OLAS can outperform the state-of-the-art methods under a rich variety of parameter settings.
    Bi-CLKT: Bi-Graph Contrastive Learning based Knowledge Tracing. (arXiv:2201.09020v1 [cs.LG])
    (0 min) The goal of Knowledge Tracing (KT) is to estimate how well students have mastered a concept based on their historical learning of related exercises. The benefit of knowledge tracing is that students' learning plans can be better organised and adjusted, and interventions can be made when necessary. With the recent rise of deep learning, Deep Knowledge Tracing (DKT) has utilised Recurrent Neural Networks (RNNs) to accomplish this task with some success. Other works have attempted to introduce Graph Neural Networks (GNNs) and redefine the task accordingly to achieve significant improvements. However, these efforts suffer from at least one of the following drawbacks: 1) they pay too much attention to details of the nodes rather than to high-level semantic information; 2) they struggle to effectively establish spatial associations and complex structures of the nodes; and 3) they represent either concepts or exercises only, without integrating them. Inspired by recent advances in self-supervised learning, we propose a Bi-Graph Contrastive Learning based Knowledge Tracing (Bi-CLKT) to address these limitations. Specifically, we design a two-layer contrastive learning scheme based on an "exercise-to-exercise" (E2E) relational subgraph. It involves node-level contrastive learning of subgraphs to obtain discriminative representations of exercises, and graph-level contrastive learning to obtain discriminative representations of concepts. Moreover, we designed a joint contrastive loss to obtain better representations and hence better prediction performance. Also, we explored two different variants, using RNN and memory-augmented neural networks as the prediction layer for comparison to obtain better representations of exercises and concepts respectively. Extensive experiments on four real-world datasets show that the proposed Bi-CLKT and its variants outperform other baseline models.
    Good Classification Measures and How to Find Them. (arXiv:2201.09044v1 [cs.LG])
    (0 min) Several performance measures can be used for evaluating classification results: accuracy, F-measure, and many others. Can we say that some of them are better than others, or, ideally, choose one measure that is best in all situations? To answer this question, we conduct a systematic analysis of classification performance measures: we formally define a list of desirable properties and theoretically analyze which measures satisfy which properties. We also prove an impossibility theorem: some desirable properties cannot be simultaneously satisfied. Finally, we propose a new family of measures satisfying all desirable properties except one. This family includes the Matthews Correlation Coefficient and a so-called Symmetric Balanced Accuracy that was not previously used in classification literature. We believe that our systematic approach gives an important tool to practitioners for adequately evaluating classification results.
    Dynamic Channel Access via Meta-Reinforcement Learning. (arXiv:2201.09075v1 [cs.NI])
    (0 min) In this paper, we address the channel access problem in a dynamic wireless environment via meta-reinforcement learning. Spectrum is a scarce resource in wireless communications, especially with the dramatic increase in the number of devices in networks. Recently, inspired by the success of deep reinforcement learning (DRL), extensive studies have been conducted in addressing wireless resource allocation problems via DRL. However, training DRL algorithms usually requires a massive amount of data collected from the environment for each specific task and the well-trained model may fail if there is a small variation in the environment. In this work, in order to address these challenges, we propose a meta-DRL framework that incorporates the method of Model-Agnostic Meta-Learning (MAML). In the proposed framework, we train a common initialization for similar channel selection tasks. From the initialization, we show that only a few gradient descents are required for adapting to different tasks drawn from the same distribution. We demonstrate the performance improvements via simulation results.
    FN-Net:Remove the Outliers by Filtering the Noise. (arXiv:2201.09213v1 [cs.CV])
    (0 min) Establishing the correspondence between two images is an important research direction of computer vision. When estimating the relationship between two images, it is often disturbed by outliers. In this paper, we propose a convolutional neural network that can filter the noise of outliers. It can output the probability that the pair of feature points is an inlier and regress the essential matrix representing the relative pose of the camera. The outliers are mainly caused by the noise introduced by the previous processing. The outliers rejection can be treated as a problem of noise elimination, and the soft threshold function has a very good effect on noise reduction. Therefore, we designed an adaptive denoising module based on soft threshold function to remove noise components in the outliers, to reduce the probability that the outlier is predicted to be an inlier. Experimental results on the YFCC100M dataset show that our method exceeds the state-of-the-art in relative pose estimation.
    On the Robustness of Counterfactual Explanations to Adverse Perturbations. (arXiv:2201.09051v1 [cs.LG])
    (0 min) Counterfactual explanations (CEs) are a powerful means for understanding how decisions made by algorithms can be changed. Researchers have proposed a number of desiderata that CEs should meet to be practically useful, such as requiring minimal effort to enact, or complying with causal models. We consider a further aspect to improve the usability of CEs: robustness to adverse perturbations, which may naturally happen due to unfortunate circumstances. Since CEs typically prescribe a sparse form of intervention (i.e., only a subset of the features should be changed), we provide two definitions of robustness, which concern, respectively, the features to change and to keep as they are. These definitions are workable in that they can be incorporated as penalty terms in the loss functions that are used for discovering CEs. To experiment with the proposed definitions of robustness, we create and release code where five data sets (commonly used in the field of fair and explainable machine learning) have been enriched with feature-specific annotations that can be used to sample meaningful perturbations. Our experiments show that CEs are often not robust and, if adverse perturbations take place, the intervention they prescribe may require a much larger cost than anticipated, or even become impossible. However, accounting for robustness in the search process, which can be done rather easily, allows discovering robust CEs systematically. Robust CEs are resilient to adverse perturbations: additional intervention to contrast perturbations is much less costly than for non-robust CEs. Our code is available at: https://github.com/marcovirgolin/robust-counterfactuals
    Distributed Bandits with Heterogeneous Agents. (arXiv:2201.09353v1 [cs.LG])
    (0 min) This paper tackles a multi-agent bandit setting where $M$ agents cooperate together to solve the same instance of a $K$-armed stochastic bandit problem. The agents are \textit{heterogeneous}: each agent has limited access to a local subset of arms and the agents are asynchronous with different gaps between decision-making rounds. The goal for each agent is to find its optimal local arm, and agents can cooperate by sharing their observations with others. While cooperation between agents improves the performance of learning, it comes with an additional complexity of communication between agents. For this heterogeneous multi-agent setting, we propose two learning algorithms, \ucbo and \AAE. We prove that both algorithms achieve order-optimal regret, which is $O\left(\sum_{i:\tilde{\Delta}_i>0} \log T/\tilde{\Delta}_i\right)$, where $\tilde{\Delta}_i$ is the minimum suboptimality gap between the reward mean of arm $i$ and any local optimal arm. In addition, a careful selection of the valuable information for cooperation, \AAE achieves a low communication complexity of $O(\log T)$. Last, numerical experiments verify the efficiency of both algorithms.
    Wisdom of the Contexts: Active Ensemble Learning for Contextual Anomaly Detection. (arXiv:2101.11560v3 [cs.LG] UPDATED)
    (0 min) In contextual anomaly detection, an object is only considered anomalous within a specific context. Most existing methods for CAD use a single context based on a set of user-specified contextual features. However, identifying the right context can be very challenging in practice, especially in datasets, with a large number of attributes. Furthermore, in real-world systems, there might be multiple anomalies that occur in different contexts and, therefore, require a combination of several "useful" contexts to unveil them. In this work, we leverage active learning and ensembles to effectively detect complex contextual anomalies in situations where the true contextual and behavioral attributes are unknown. We propose a novel approach, called WisCon (Wisdom of the Contexts), that automatically creates contexts from the feature set. Our method constructs an ensemble of multiple contexts, with varying importance scores, based on the assumption that not all useful contexts are equally so. Experiments show that WisCon significantly outperforms existing baselines in different categories (i.e., active classifiers, unsupervised contextual and non-contextual anomaly detectors, and supervised classifiers) on seven datasets. Furthermore, the results support our initial hypothesis that there is no single perfect context that successfully uncovers all kinds of contextual anomalies, and leveraging the "wisdom" of multiple contexts is necessary.
    A Machine Learning Framework for Distributed Functional Compression over Wireless Channels in IoT. (arXiv:2201.09483v1 [cs.LG])
    (0 min) IoT devices generating enormous data and state-of-the-art machine learning techniques together will revolutionize cyber-physical systems. In many diverse fields, from autonomous driving to augmented reality, distributed IoT devices compute specific target functions without simple forms like obstacle detection, object recognition, etc. Traditional cloud-based methods that focus on transferring data to a central location either for training or inference place enormous strain on network resources. To address this, we develop, to the best of our knowledge, the first machine learning framework for distributed functional compression over both the Gaussian Multiple Access Channel (GMAC) and orthogonal AWGN channels. Due to the Kolmogorov-Arnold representation theorem, our machine learning framework can, by design, compute any arbitrary function for the desired functional compression task in IoT. Importantly the raw sensory data are never transferred to a central node for training or inference, thus reducing communication. For these algorithms, we provide theoretical convergence guarantees and upper bounds on communication. Our simulations show that the learned encoders and decoders for functional compression perform significantly better than traditional approaches, are robust to channel condition changes and sensor outages. Compared to the cloud-based scenario, our algorithms reduce channel use by two orders of magnitude.
    Out of Distribution Detection on ImageNet-O. (arXiv:2201.09352v1 [cs.CV])
    (0 min) Out of distribution (OOD) detection is a crucial part of making machine learning systems robust. The ImageNet-O dataset is an important tool in testing the robustness of ImageNet trained deep neural networks that are widely used across a variety of systems and applications. We aim to perform a comparative analysis of OOD detection methods on ImageNet-O, a first of its kind dataset with a label distribution different than that of ImageNet, that has been created to aid research in OOD detection for ImageNet models. As this dataset is fairly new, we aim to provide a comprehensive benchmarking of some of the current state of the art OOD detection methods on this novel dataset. This benchmarking covers a variety of model architectures, settings where we haves prior access to the OOD data versus when we don't, predictive score based approaches, deep generative approaches to OOD detection, and more.
    Topological data analysis and clustering. (arXiv:2201.09054v1 [math.AT])
    (0 min) Clustering is one of the most common tasks of Machine Learning. In this paper we examine how ideas from topology can be used to improve clustering techniques.
    Active Learning Polynomial Threshold Functions. (arXiv:2201.09433v1 [cs.LG])
    (0 min) We initiate the study of active learning polynomial threshold functions (PTFs). While traditional lower bounds imply that even univariate quadratics cannot be non-trivially actively learned, we show that allowing the learner basic access to the derivatives of the underlying classifier circumvents this issue and leads to a computationally efficient algorithm for active learning degree-$d$ univariate PTFs in $\tilde{O}(d^3\log(1/\varepsilon\delta))$ queries. We also provide near-optimal algorithms and analyses for active learning PTFs in several average case settings. Finally, we prove that access to derivatives is insufficient for active learning multivariate PTFs, even those of just two variables.
    MIDAS: Deep learning human action intention prediction from natural eye movement patterns. (arXiv:2201.09135v1 [cs.CV])
    (0 min) Eye movements have long been studied as a window into the attentional mechanisms of the human brain and made accessible as novelty style human-machine interfaces. However, not everything that we gaze upon, is something we want to interact with; this is known as the Midas Touch problem for gaze interfaces. To overcome the Midas Touch problem, present interfaces tend not to rely on natural gaze cues, but rather use dwell time or gaze gestures. Here we present an entirely data-driven approach to decode human intention for object manipulation tasks based solely on natural gaze cues. We run data collection experiments where 16 participants are given manipulation and inspection tasks to be performed on various objects on a table in front of them. The subjects' eye movements are recorded using wearable eye-trackers allowing the participants to freely move their head and gaze upon the scene. We use our Semantic Fovea, a convolutional neural network model to obtain the objects in the scene and their relation to gaze traces at every frame. We then evaluate the data and examine several ways to model the classification task for intention prediction. Our evaluation shows that intention prediction is not a naive result of the data, but rather relies on non-linear temporal processing of gaze cues. We model the task as a time series classification problem and design a bidirectional Long-Short-Term-Memory (LSTM) network architecture to decode intentions. Our results show that we can decode human intention of motion purely from natural gaze cues and object relative position, with $91.9\%$ accuracy. Our work demonstrates the feasibility of natural gaze as a Zero-UI interface for human-machine interaction, i.e., users will only need to act naturally, and do not need to interact with the interface itself or deviate from their natural eye movement patterns.
    The Many Faces of Adversarial Risk. (arXiv:2201.08956v1 [stat.ML])
    (0 min) Adversarial risk quantifies the performance of classifiers on adversarially perturbed data. Numerous definitions of adversarial risk -- not all mathematically rigorous and differing subtly in the details -- have appeared in the literature. In this paper, we revisit these definitions, make them rigorous, and critically examine their similarities and differences. Our technical tools derive from optimal transport, robust statistics, functional analysis, and game theory. Our contributions include the following: generalizing Strassen's theorem to the unbalanced optimal transport setting with applications to adversarial classification with unequal priors; showing an equivalence between adversarial robustness and robust hypothesis testing with $\infty$-Wasserstein uncertainty sets; proving the existence of a pure Nash equilibrium in the two-player game between the adversary and the algorithm; and characterizing adversarial risk by the minimum Bayes error between a pair of distributions belonging to the $\infty$-Wasserstein uncertainty sets. Our results generalize and deepen recently discovered connections between optimal transport and adversarial robustness and reveal new connections to Choquet capacities and game theory.
    Artificial Intelligence for Suicide Assessment using Audiovisual Cues: A Review. (arXiv:2201.09130v1 [cs.AI])
    (0 min) Death by suicide is the seventh of the leading death cause worldwide. The recent advancement in Artificial Intelligence (AI), specifically AI application in image and voice processing, has created a promising opportunity to revolutionize suicide risk assessment. Subsequently, we have witnessed fast-growing literature of researches that applies AI to extract audiovisual non-verbal cues for mental illness assessment. However, the majority of the recent works focus on depression, despite the evident difference between depression signs and suicidal behavior non-verbal cues. In this paper, we review the recent works that study suicide ideation and suicide behavior detection through audiovisual feature analysis, mainly suicidal voice/speech acoustic features analysis and suicidal visual cues.
    glassoformer: a query-sparse transformer for post-fault power grid voltage prediction. (arXiv:2201.09145v1 [cs.LG])
    (0 min) We propose GLassoformer, a novel and efficient transformer architecture leveraging group Lasso regularization to reduce the number of queries of the standard self-attention mechanism. Due to the sparsified queries, GLassoformer is more computationally efficient than the standard transformers. On the power grid post-fault voltage prediction task, GLassoformer shows remarkably better prediction than many existing benchmark algorithms in terms of accuracy and stability.
    PiCO: Contrastive Label Disambiguation for Partial Label Learning. (arXiv:2201.08984v1 [cs.LG])
    (0 min) Partial label learning (PLL) is an important problem that allows each training example to be labeled with a coarse candidate set, which well suits many real-world data annotation scenarios with label ambiguity. Despite the promise, the performance of PLL often lags behind the supervised counterpart. In this work, we bridge the gap by addressing two key research challenges in PLL -- representation learning and label disambiguation -- in one coherent framework. Specifically, our proposed framework PiCO consists of a contrastive learning module along with a novel class prototype-based label disambiguation algorithm. PiCO produces closely aligned representations for examples from the same classes and facilitates label disambiguation. Theoretically, we show that these two components are mutually beneficial, and can be rigorously justified from an expectation-maximization (EM) algorithm perspective. Extensive experiments demonstrate that PiCO significantly outperforms the current state-of-the-art approaches in PLL and even achieves comparable results to fully supervised learning. Code and data available: https://github.com/hbzju/PiCO.
    NAS-VAD: Neural Architecture Search for Voice Activity Detection. (arXiv:2201.09032v1 [cs.SD])
    (0 min) The need for automatic design of deep neural networks has led to the emergence of neural architecture search (NAS), which has generated models outperforming manually-designed models. However, most existing NAS frameworks are designed for image processing tasks, and lack structures and operations effective for voice activity detection (VAD) tasks. To discover improved VAD models through automatic design, we present the first work that proposes a NAS framework optimized for the VAD task. The proposed NAS-VAD framework expands the existing search space with the attention mechanism while incorporating the compact macro-architecture with fewer cells. The experimental results show that the models discovered by NAS-VAD outperform the existing manually-designed VAD models in various synthetic and real-world datasets. Our code and models are available at https://github.com/daniel03c1/NAS_VAD.
    Concurrent Adversarial Learning for Large-Batch Training. (arXiv:2106.00221v2 [cs.LG] UPDATED)
    (0 min) Large-batch training has become a commonly used technique when training neural networks with a large number of GPU/TPU processors. As batch size increases, stochastic optimizers tend to converge to sharp local minima, leading to degraded test performance. Current methods usually use extensive data augmentation to increase the batch size, but we found the performance gain with data augmentation decreases as batch size increases, and data augmentation will become insufficient after certain point. In this paper, we propose to use adversarial learning to increase the batch size in large-batch training. Despite being a natural choice for smoothing the decision surface and biasing towards a flat region, adversarial learning has not been successfully applied in large-batch training since it requires at least two sequential gradient computations at each step, which will at least double the running time compared with vanilla training even with a large number of processors. To overcome this issue, we propose a novel Concurrent Adversarial Learning (ConAdv) method that decouple the sequential gradient computations in adversarial learning by utilizing staled parameters. Experimental results demonstrate that ConAdv can successfully increase the batch size on ResNet-50 training on ImageNet while maintaining high accuracy. In particular, we show ConAdv along can achieve 75.3\% top-1 accuracy on ImageNet ResNet-50 training with 96K batch size, and the accuracy can be further improved to 76.2\% when combining ConAdv with data augmentation. This is the first work successfully scales ResNet-50 training batch size to 96K.
    Model-free Reinforcement Learning for Robust Locomotion using Demonstrations from Trajectory Optimization. (arXiv:2107.06629v2 [cs.RO] UPDATED)
    (0 min) We present a general, two-stage reinforcement learning approach to create robust policies that can be deployed on real robots without any additional training using a single demonstration generated by trajectory optimization. The demonstration is used in the first stage as a starting point to facilitate initial exploration. In the second stage, the relevant task reward is optimized directly and a policy robust to environment uncertainties is computed. We demonstrate and examine in detail the performance and robustness of our approach on highly dynamic hopping and bounding tasks on a quadruped robot.
    EGGS: Eigen-Gap Guided Search For Automated Spectral Clustering. (arXiv:2107.12183v3 [cs.LG] UPDATED)
    (0 min) The performance of spectral clustering heavily relies on the quality of affinity matrix. A variety of affinity-matrix-construction (AMC) methods have been proposed but they have hyperparameters to determine beforehand, which requires strong experience and lead to difficulty in real applications especially when the inter-cluster similarity is high or/and the dataset is large. In addition, we often need to choose different AMC methods for different datasets, which still depends on experience. To solve these two challenging problems, in this paper, we present a simple yet effective method for automated spectral clustering. The main idea is to find the most reliable affinity matrix among a set of candidates given by different AMC methods with different hyperparameters, where the reliability is quantified by the \textit{relative-eigen-gap} of graph Laplacian introduced in this paper. We also implement the method using Bayesian optimization.We extend the method to large-scale datasets such as MNIST, on which the time cost is less than 90s and the clustering accuracy is state-of-the-art. Extensive experiments of natural image clustering show that our method is more versatile, accurate, and efficient than baseline methods.
    Out-of-Distribution Detection Using Outlier Detection Methods. (arXiv:2108.08218v2 [cs.LG] UPDATED)
    (0 min) Out-of-distribution detection (OOD) deals with anomalous input to neural networks. In the past, specialized methods have been proposed to reject predictions on anomalous input. Similarly, it was shown that feature extraction models in combination with outlier detection algorithms are well suited to detect anomalous input. We use outlier detection algorithms to detect anomalous input as reliable as specialized methods from the field of OOD. No neural network adaptation is required; detection is based on the model's softmax score. Our approach works unsupervised using an Isolation Forest and can be further improved by using a supervised learning method such as Gradient Boosting.
    The Supermarket Model with Known and Predicted Service Times. (arXiv:1905.12155v3 [cs.PF] UPDATED)
    (0 min) The supermarket model refers to a system with a large number of queues, where new customers choose d queues at random and join the one with the fewest customers. This model demonstrates the power of even small amounts of choice, as compared to simply joining a queue chosen uniformly at random, for load balancing systems. In this work we perform simulation-based studies to consider variations where service times for a customer are predicted, as might be done in modern settings using machine learning techniques or related mechanisms. Our primary takeaway is that using even seemingly weak predictions of service times can yield significant benefits over blind First In First Out queueing in this context. However, some care must be taken when using predicted service time information to both choose a queue and order elements for service within a queue; while in many cases using the information for both choosing and ordering is beneficial, in many of our simulation settings we find that simply using the number of jobs to choose a queue is better when using predicted service times to order jobs in a queue. In our simulations, we evaluate both synthetic and real-world workloads--in the latter, service times are predicted by machine learning. Our results provide practical guidance for the design of real-world systems; moreover, we leave many natural theoretical open questions for future work, validating their relevance to real-world situations.
    Efficient and Robust Classification for Sparse Attacks. (arXiv:2201.09369v1 [cs.LG])
    (0 min) In the past two decades we have seen the popularity of neural networks increase in conjunction with their classification accuracy. Parallel to this, we have also witnessed how fragile the very same prediction models are: tiny perturbations to the inputs can cause misclassification errors throughout entire datasets. In this paper, we consider perturbations bounded by the $\ell_0$--norm, which have been shown as effective attacks in the domains of image-recognition, natural language processing, and malware-detection. To this end, we propose a novel defense method that consists of "truncation" and "adversarial training". We then theoretically study the Gaussian mixture setting and prove the asymptotic optimality of our proposed classifier. Motivated by the insights we obtain, we extend these components to neural network classifiers. We conduct numerical experiments in the domain of computer vision using the MNIST and CIFAR datasets, demonstrating significant improvement for the robust classification error of neural networks.
    Probability Distribution on Rooted Trees. (arXiv:2201.09460v1 [cs.LG])
    (0 min) The hierarchical and recursive expressive capability of rooted trees is applicable to represent statistical models in various areas, such as data compression, image processing, and machine learning. On the other hand, such hierarchical expressive capability causes a problem in tree selection to avoid overfitting. One unified approach to solve this is a Bayesian approach, on which the rooted tree is regarded as a random variable and a direct loss function can be assumed on the selected model or the predicted value for a new data point. However, all the previous studies on this approach are based on the probability distribution on full trees, to the best of our knowledge. In this paper, we propose a generalized probability distribution for any rooted trees in which only the maximum number of child nodes and the maximum depth are fixed. Furthermore, we derive recursive methods to evaluate the characteristics of the probability distribution without any approximations.
    Accelerated Intravascular Ultrasound Imaging using Deep Reinforcement Learning. (arXiv:2201.09522v1 [eess.SP])
    (0 min) Intravascular ultrasound (IVUS) offers a unique perspective in the treatment of vascular diseases by creating a sequence of ultrasound-slices acquired from within the vessel. However, unlike conventional hand-held ultrasound, the thin catheter only provides room for a small number of physical channels for signal transfer from a transducer-array at the tip. For continued improvement of image quality and frame rate, we present the use of deep reinforcement learning to deal with the current physical information bottleneck. Valuable inspiration has come from the field of magnetic resonance imaging (MRI), where learned acquisition schemes have brought significant acceleration in image acquisition at competing image quality. To efficiently accelerate IVUS imaging, we propose a framework that utilizes deep reinforcement learning for an optimal adaptive acquisition policy on a per-frame basis enabled by actor-critic methods and Gumbel top-$K$ sampling.
    Scalable Safe Exploration for Global Optimization of Dynamical Systems. (arXiv:2201.09562v1 [cs.LG])
    (0 min) Learning optimal control policies directly on physical systems is challenging since even a single failure can lead to costly hardware damage. Most existing learning methods that guarantee safety, i.e., no failures, during exploration are limited to local optima. A notable exception is the GoSafe algorithm, which, unfortunately, cannot handle high-dimensional systems and hence cannot be applied to most real-world dynamical systems. This work proposes GoSafeOpt as the first algorithm that can safely discover globally optimal policies for complex systems while giving safety and optimality guarantees. Our experiments on a robot arm that would be prohibitive for GoSafe demonstrate that GoSafeOpt safely finds remarkably better policies than competing safe learning methods for high-dimensional domains.
    Effective Learning of a GMRF Mixture Model. (arXiv:2005.09030v3 [cs.LG] UPDATED)
    (0 min) Learning a Gaussian Mixture Model (GMM) is hard when the number of parameters is too large given the amount of available data. As a remedy, we propose restricting the GMM to a Gaussian Markov Random Field Mixture Model (GMRF-MM), as well as a new method for estimating the latter's sparse precision (i.e., inverse covariance) matrices. When the sparsity pattern of each matrix is known, we propose an efficient optimization method for the Maximum Likelihood Estimate (MLE) of that matrix. When it is unknown, we utilize the popular Graphical Least Absolute Shrinkage and Selection Operator (GLASSO) to estimate that pattern. However, we show that even for a single Gaussian, when GLASSO is tuned to successfully estimate the sparsity pattern, it does so at the price of a substantial bias of the values of the nonzero entries of the matrix, and we show that this problem only worsens in a mixture setting. To overcome this, we discard the nonzero values estimated by GLASSO, keep only its pattern estimate and use it within the proposed MLE method. This yields an effective two-step procedure that removes the bias. We show that our "debiasing" approach outperforms GLASSO in both the single-GMRF and the GMRF-MM cases. We also show that when learning priors for image patches, our method outperforms GLASSO even if we merely use an educated guess about the sparsity pattern, and that our GMRF-MM outperforms the baseline GMM on real and synthetic high-dimensional datasets.
    Deep Localization of Mixed Image Tampering Techniques. (arXiv:1904.08484v3 [cs.CV] UPDATED)
    (0 min) With technological advances leading to an increase in mechanisms for image tampering, fraud detection methods must continue to be upgraded to match their sophistication. One problem with current methods is that they require prior knowledge of the method of forgery in order to determine which features to extract from the image to localize the region of interest. When a machine learning algorithm is used to learn different types of tampering from a large set of various image types, with a large enough database we can easily classify which images are tampered. However, we still are left with the question of which features to train on, and how to localize the manipulation. In this work, deep learning for object detection is adapted to tampering detection to solve these two problems, while fusing features from multiple classic techniques for improved accuracy. A Multi-stream version of the Faster RCNN network will be employed with the second stream having an input of the element-wise sum of the ELA and BAG error maps to provide even higher accuracy than a single stream alone.
    Communication-efficient k-Means for Edge-based Machine Learning. (arXiv:2102.04282v2 [cs.LG] UPDATED)
    (0 min) We consider the problem of computing the k-means centers for a large high-dimensional dataset in the context of edge-based machine learning, where data sources offload machine learning computation to nearby edge servers. k-Means computation is fundamental to many data analytics, and the capability of computing provably accurate k-means centers by leveraging the computation power of the edge servers, at a low communication and computation cost to the data sources, will greatly improve the performance of these analytics. We propose to let the data sources send small summaries, generated by joint dimensionality reduction (DR), cardinality reduction (CR), and quantization (QT), to support approximate k-means computation at reduced complexity and communication cost. By analyzing the complexity, the communication cost, and the approximation error of k-means algorithms based on carefully designed composition of DR/CR/QT methods, we show that: (i) it is possible to compute near-optimal k-means centers at a near-linear complexity and a constant or logarithmic communication cost, (ii) the order of applying DR and CR significantly affects the complexity and the communication cost, and (iii) combining DR/CR methods with a properly configured quantizer can further reduce the communication cost without compromising the other performance metrics. Our theoretical analysis has been validated through experiments based on real datasets.
    Learning to Predict Gradients for Semi-Supervised Continual Learning. (arXiv:2201.09196v1 [cs.LG])
    (0 min) A key challenge for machine intelligence is to learn new visual concepts without forgetting the previously acquired knowledge. Continual learning is aimed towards addressing this challenge. However, there is a gap between existing supervised continual learning and human-like intelligence, where human is able to learn from both labeled and unlabeled data. How unlabeled data affects learning and catastrophic forgetting in the continual learning process remains unknown. To explore these issues, we formulate a new semi-supervised continual learning method, which can be generically applied to existing continual learning models. Specifically, a novel gradient learner learns from labeled data to predict gradients on unlabeled data. Hence, the unlabeled data could fit into the supervised continual learning method. Different from conventional semi-supervised settings, we do not hypothesize that the underlying classes, which are associated to the unlabeled data, are known to the learning process. In other words, the unlabeled data could be very distinct from the labeled data. We evaluate the proposed method on mainstream continual learning, adversarial continual learning, and semi-supervised learning tasks. The proposed method achieves state-of-the-art performance on classification accuracy and backward transfer in the continual learning setting while achieving desired performance on classification accuracy in the semi-supervised learning setting. This implies that the unlabeled images can enhance the generalizability of continual learning models on the predictive ability on unseen data and significantly alleviate catastrophic forgetting. The code is available at \url{https://github.com/luoyan407/grad_prediction.git}.
    Efficient Algorithms for Finite Horizon and Streaming Restless Multi-Armed Bandit Problems. (arXiv:2103.04730v2 [cs.LG] UPDATED)
    (0 min) We propose Streaming Bandits, a Restless Multi Armed Bandit (RMAB) framework in which heterogeneous arms may arrive and leave the system after staying on for a finite lifetime. Streaming Bandits naturally capture the health intervention planning problem, where health workers must manage the health outcomes of a patient cohort while new patients join and existing patients leave the cohort each day. Our contributions are as follows: (1) We derive conditions under which our problem satisfies indexability, a precondition that guarantees the existence and asymptotic optimality of the Whittle Index solution for RMABs. We establish the conditions using a polytime reduction of the Streaming Bandit setup to regular RMABs. (2) We further prove a phenomenon that we call index decay, whereby the Whittle index values are low for short residual lifetimes driving the intuition underpinning our algorithm. (3) We propose a novel and efficient algorithm to compute the index-based solution for Streaming Bandits. Unlike previous methods, our algorithm does not rely on solving the costly finite horizon problem on each arm of the RMAB, thereby lowering the computational complexity compared to existing methods. (4) Finally, we evaluate our approach via simulations run on realworld data sets from a tuberculosis patient monitoring task and an intervention planning task for improving maternal healthcare, in addition to other synthetic domains. Across the board, our algorithm achieves a 2-orders-of-magnitude speed-up over existing methods while maintaining the same solution quality.
    Hardware/Software Co-Programmable Framework for Computational SSDs to Accelerate Deep Learning Service on Large-Scale Graphs. (arXiv:2201.09189v1 [cs.AR])
    (0 min) Graph neural networks (GNNs) process large-scale graphs consisting of a hundred billion edges. In contrast to traditional deep learning, unique behaviors of the emerging GNNs are engaged with a large set of graphs and embedding data on storage, which exhibits complex and irregular preprocessing. We propose a novel deep learning framework on large graphs, HolisticGNN, that provides an easy-to-use, near-storage inference infrastructure for fast, energy-efficient GNN processing. To achieve the best end-to-end latency and high energy efficiency, HolisticGNN allows users to implement various GNN algorithms and directly executes them where the actual data exist in a holistic manner. It also enables RPC over PCIe such that the users can simply program GNNs through a graph semantic library without any knowledge of the underlying hardware or storage configurations. We fabricate HolisticGNN's hardware RTL and implement its software on an FPGA-based computational SSD (CSSD). Our empirical evaluations show that the inference time of HolisticGNN outperforms GNN inference services using high-performance modern GPUs by 7.1x while reducing energy consumption by 33.2x, on average.
    Perceptual cGAN for MRI Super-resolution. (arXiv:2201.09314v1 [eess.IV])
    (0 min) Capturing high-resolution magnetic resonance (MR) images is a time consuming process, which makes it unsuitable for medical emergencies and pediatric patients. Low-resolution MR imaging, by contrast, is faster than its high-resolution counterpart, but it compromises on fine details necessary for a more precise diagnosis. Super-resolution (SR), when applied to low-resolution MR images, can help increase their utility by synthetically generating high-resolution images with little additional time. In this paper, we present a SR technique for MR images that is based on generative adversarial networks (GANs), which have proven to be quite useful in generating sharp-looking details in SR. We introduce a conditional GAN with perceptual loss, which is conditioned upon the input low-resolution image, which improves the performance for isotropic and anisotropic MRI super-resolution.
    Explore the Expression: Facial Expression Generation using Auxiliary Classifier Generative Adversarial Network. (arXiv:2201.09061v1 [cs.CV])
    (0 min) Facial expressions are a form of non-verbal communication that humans perform seamlessly for meaningful transfer of information. Most of the literature addresses the facial expression recognition aspect however, with the advent of Generative Models, it has become possible to explore the affect space in addition to mere classification of a set of expressions. In this article, we propose a generative model architecture which robustly generates a set of facial expressions for multiple character identities and explores the possibilities of generating complex expressions by combining the simple ones.
    External Stability Auditing to Test the Validity of Personality Prediction in AI Hiring. (arXiv:2201.09151v1 [cs.CY])
    (0 min) Automated hiring systems are among the fastest-developing of all high-stakes AI systems. Among these are algorithmic personality tests that use insights from psychometric testing, and promise to surface personality traits indicative of future success based on job seekers' resumes or social media profiles. We interrogate the validity of such systems using stability of the outputs they produce, noting that reliability is a necessary, but not a sufficient, condition for validity. Our approach is to (a) develop a methodology for an external audit of stability of predictions made by algorithmic personality tests, and (b) instantiate this methodology in an audit of two systems, Humantic AI and Crystal. Crucially, rather than challenging or affirming the assumptions made in psychometric testing -- that personality is a meaningful and measurable construct, and that personality traits are indicative of future success on the job -- we frame our methodology around testing the underlying assumptions made by the vendors of the algorithmic personality tests themselves. In our audit of Humantic AI and Crystal, we find that both systems show substantial instability with respect to key facets of measurement, and so cannot be considered valid testing instruments. For example, Crystal frequently computes different personality scores if the same resume is given in PDF vs. in raw text format, violating the assumption that the output of an algorithmic personality test is stable across job-irrelevant variations in the input. Among other notable findings is evidence of persistent -- and often incorrect -- data linkage by Humantic AI.
    Dichotomic Pattern Mining with Applications to Intent Prediction from Semi-Structured Clickstream Datasets. (arXiv:2201.09178v1 [cs.AI])
    (0 min) We introduce a pattern mining framework that operates on semi-structured datasets and exploits the dichotomy between outcomes. Our approach takes advantage of constraint reasoning to find sequential patterns that occur frequently and exhibit desired properties. This allows the creation of novel pattern embeddings that are useful for knowledge extraction and predictive modeling. Finally, we present an application on customer intent prediction from digital clickstream data. Overall, we show that pattern embeddings play an integrator role between semi-structured data and machine learning models, improve the performance of the downstream task and retain interpretability.
    Signal Strength and Noise Drive Feature Preference in CNN Image Classifiers. (arXiv:2201.08893v1 [cs.CV])
    (0 min) Feature preference in Convolutional Neural Network (CNN) image classifiers is integral to their decision making process, and while the topic has been well studied, it is still not understood at a fundamental level. We test a range of task relevant feature attributes (including shape, texture, and color) with varying degrees of signal and noise in highly controlled CNN image classification experiments using synthetic datasets to determine feature preferences. We find that CNNs will prefer features with stronger signal strength and lower noise irrespective of whether the feature is texture, shape, or color. This provides guidance for a predictive model for task relevant feature preferences, demonstrates pathways for bias in machine models that can be avoided with careful controls on experimental setup, and suggests that comparisons between how humans and machines prefer task relevant features in vision classification tasks should be revisited. Code to reproduce experiments in this paper can be found at \url{https://github.com/mwolff31/signal_preference}.
    An Attention-based ConvLSTM Autoencoder with Dynamic Thresholding for Unsupervised Anomaly Detection in Multivariate Time Series. (arXiv:2201.09172v1 [cs.LG])
    (0 min) As a substantial amount of multivariate time series data is being produced by the complex systems in Smart Manufacturing, improved anomaly detection frameworks are needed to reduce the operational risks and the monitoring burden placed on the system operators. However, building such frameworks is challenging, as a sufficiently large amount of defective training data is often not available and frameworks are required to capture both the temporal and contextual dependencies across different time steps while being robust to noise. In this paper, we propose an unsupervised Attention-based Convolutional Long Short-Term Memory (ConvLSTM) Autoencoder with Dynamic Thresholding (ACLAE-DT) framework for anomaly detection and diagnosis in multivariate time series. The framework starts by pre-processing and enriching the data, before constructing feature images to characterize the system statuses across different time steps by capturing the inter-correlations between pairs of time series. Afterwards, the constructed feature images are fed into an attention-based ConvLSTM autoencoder, which aims to encode the constructed feature images and capture the temporal behavior, followed by decoding the compressed knowledge representation to reconstruct the feature images input. The reconstruction errors are then computed and subjected to a statistical-based, dynamic thresholding mechanism to detect and diagnose the anomalies. Evaluation results conducted on real-life manufacturing data demonstrate the performance strengths of the proposed approach over state-of-the-art methods under different experimental settings.
    GreaseLM: Graph REASoning Enhanced Language Models for Question Answering. (arXiv:2201.08860v1 [cs.CL])
    (0 min) Answering complex questions about textual narratives requires reasoning over both stated context and the world knowledge that underlies it. However, pretrained language models (LM), the foundation of most modern QA systems, do not robustly represent latent relationships between concepts, which is necessary for reasoning. While knowledge graphs (KG) are often used to augment LMs with structured representations of world knowledge, it remains an open question how to effectively fuse and reason over the KG representations and the language context, which provides situational constraints and nuances. In this work, we propose GreaseLM, a new model that fuses encoded representations from pretrained LMs and graph neural networks over multiple layers of modality interaction operations. Information from both modalities propagates to the other, allowing language context representations to be grounded by structured world knowledge, and allowing linguistic nuances (e.g., negation, hedging) in the context to inform the graph representations of knowledge. Our results on three benchmarks in the commonsense reasoning (i.e., CommonsenseQA, OpenbookQA) and medical question answering (i.e., MedQA-USMLE) domains demonstrate that GreaseLM can more reliably answer questions that require reasoning over both situational constraints and structured knowledge, even outperforming models 8x larger.
    Multi-Agent Adversarial Attacks for Multi-Channel Communications. (arXiv:2201.09149v1 [cs.MA])
    (0 min) Recently Reinforcement Learning (RL) has been applied as an anti-adversarial remedy in wireless communication networks. However, studying the RL-based approaches from the adversary's perspective has received little attention. Additionally, RL-based approaches in an anti-adversary or adversarial paradigm mostly consider single-channel communication (either channel selection or single channel power control), while multi-channel communication is more common in practice. In this paper, we propose a multi-agent adversary system (MAAS) for modeling and analyzing adversaries in a wireless communication scenario by careful design of the reward function under realistic communication scenarios. In particular, by modeling the adversaries as learning agents, we show that the proposed MAAS is able to successfully choose the transmitted channel(s) and their respective allocated power(s) without any prior knowledge of the sender strategy. Compared to the single-agent adversary (SAA), multi-agents in MAAS can achieve significant reduction in signal-to-noise ratio (SINR) under the same power constraints and partial observability, while providing improved stability and a more efficient learning process. Moreover, through empirical studies we show that the results in simulation are close to the ones in communication in reality, a conclusion that is pivotal to the validity of performance of agents evaluated in simulations.
    Overcoming Oversmoothness in Graph Convolutional Networks via Hybrid Scattering Networks. (arXiv:2201.08932v1 [stat.ML])
    (0 min) Geometric deep learning (GDL) has made great strides towards generalizing the design of structure-aware neural network architectures from traditional domains to non-Euclidean ones, such as graphs. This gave rise to graph neural network (GNN) models that can be applied to graph-structured datasets arising, for example, in social networks, biochemistry, and material science. Graph convolutional networks (GCNs) in particular, inspired by their Euclidean counterparts, have been successful in processing graph data by extracting structure-aware features. However, current GNN models (and GCNs in particular) are known to be constrained by various phenomena that limit their expressive power and ability to generalize to more complex graph datasets. Most models essentially rely on low-pass filtering of graph signals via local averaging operations, thus leading to oversmoothing. Here, we propose a hybrid GNN framework that combines traditional GCN filters with band-pass filters defined via the geometric scattering transform. We further introduce an attention framework that allows the model to locally attend over the combined information from different GNN filters at the node level. Our theoretical results establish the complementary benefits of the scattering filters to leverage structural information from the graph, while our experiments show the benefits of our method on various learning tasks.
  • stat.ML updates on arXiv.org

    Sinkformers: Transformers with Doubly Stochastic Attention. (arXiv:2110.11773v2 [cs.LG] UPDATED)
    (2 min) Attention based models such as Transformers involve pairwise interactions between data points, modeled with a learnable attention matrix. Importantly, this attention matrix is normalized with the SoftMax operator, which makes it row-wise stochastic. In this paper, we propose instead to use Sinkhorn's algorithm to make attention matrices doubly stochastic. We call the resulting model a Sinkformer. We show that the row-wise stochastic attention matrices in classical Transformers get close to doubly stochastic matrices as the number of epochs increases, justifying the use of Sinkhorn normalization as an informative prior. On the theoretical side, we show that, unlike the SoftMax operation, this normalization makes it possible to understand the iterations of self-attention modules as a discretized gradient-flow for the Wasserstein metric. We also show in the infinite number of samples limit that, when rescaling both attention matrices and depth, Sinkformers operate a heat diffusion. On the experimental side, we show that Sinkformers enhance model accuracy in vision and natural language processing tasks. In particular, on 3D shapes classification, Sinkformers lead to a significant improvement.
    Understanding Instance-based Interpretability of Variational Auto-Encoders. (arXiv:2105.14203v4 [cs.LG] UPDATED)
    (2 min) Instance-based interpretation methods have been widely studied for supervised learning methods as they help explain how black box neural networks predict. However, instance-based interpretations remain ill-understood in the context of unsupervised learning. In this paper, we investigate influence functions [Koh and Liang, 2017], a popular instance-based interpretation method, for a class of deep generative models called variational auto-encoders (VAE). We formally frame the counter-factual question answered by influence functions in this setting, and through theoretical analysis, examine what they reveal about the impact of training samples on classical unsupervised learning methods. We then introduce VAE- TracIn, a computationally efficient and theoretically sound solution based on Pruthi et al. [2020], for VAEs. Finally, we evaluate VAE-TracIn on several real world datasets with extensive quantitative and qualitative analysis.
    On Effective Scheduling of Model-based Reinforcement Learning. (arXiv:2111.08550v2 [cs.LG] UPDATED)
    (2 min) Model-based reinforcement learning has attracted wide attention due to its superior sample efficiency. Despite its impressive success so far, it is still unclear how to appropriately schedule the important hyperparameters to achieve adequate performance, such as the real data ratio for policy optimization in Dyna-style model-based algorithms. In this paper, we first theoretically analyze the role of real data in policy training, which suggests that gradually increasing the ratio of real data yields better performance. Inspired by the analysis, we propose a framework named AutoMBPO to automatically schedule the real data ratio as well as other hyperparameters in training model-based policy optimization (MBPO) algorithm, a representative running case of model-based methods. On several continuous control tasks, the MBPO instance trained with hyperparameters scheduled by AutoMBPO can significantly surpass the original one, and the real data ratio schedule found by AutoMBPO shows consistency with our theoretical analysis.
    Polyak-Ruppert Averaged Q-Leaning is Statistically Efficient. (arXiv:2112.14582v2 [stat.ML] UPDATED)
    (2 min) We study synchronous Q-learning with Polyak-Ruppert averaging (a.k.a., averaged Q-learning) in a $\gamma$-discounted MDP. We establish a functional central limit theorem (FCLT) for the averaged iteration $\bar{\boldsymbol{Q}}_T$ and show its standardized partial-sum process weakly converges to a rescaled Brownian motion. Furthermore, we show that $\bar{\boldsymbol{Q}}_T$ is actually a regular asymptotically linear (RAL) estimator for the optimal Q-value function $\boldsymbol{Q}^*$ with the most efficient influence function. This implies the averaged Q-learning iteration has the smallest asymptotic variance among all RAL estimators. In addition, we present a non-asymptotic analysis for the $\ell_{\infty}$ error $\mathbb{E}\|\bar{\boldsymbol{Q}}_T-\boldsymbol{Q}^*\|_{\infty}$, showing for polynomial step sizes it matches the instance-dependent lower bound as well as the optimal minimax complexity lower bound. In short, our theoretical analysis shows averaged Q-learning is statistically efficient.
    Learning Deep Graph Representations via Convolutional Neural Networks. (arXiv:2004.02131v2 [cs.LG] UPDATED)
    (2 min) Graph-structured data arise in many scenarios. A fundamental problem is to quantify the similarities of graphs for tasks such as classification. R-convolution graph kernels are positive-semidefinite functions that decompose graphs into substructures and compare them. One problem in the effective implementation of this idea is that the substructures are not independent, which leads to high-dimensional feature space. In addition, graph kernels cannot capture the high-order complex interactions between vertices. To mitigate these two problems, we propose a framework called DeepMap to learn deep representations for graph feature maps. The learned deep representation for a graph is a dense and low-dimensional vector that captures complex high-order interactions in a vertex neighborhood. DeepMap extends Convolutional Neural Networks (CNNs) to arbitrary graphs by generating aligned vertex sequences and building the receptive field for each vertex. We empirically validate DeepMap on various graph classification benchmarks and demonstrate that it achieves state-of-the-art performance.
    Learning PSD-valued functions using kernel sums-of-squares. (arXiv:2111.11306v2 [stat.ML] UPDATED)
    (2 min) Shape constraints such as positive semi-definiteness (PSD) for matrices or convexity for functions play a central role in many applications in machine learning and sciences, including metric learning, optimal transport, and economics. Yet, very few function models exist that enforce PSD-ness or convexity with good empirical performance and theoretical guarantees. In this paper, we introduce a kernel sum-of-squares model for functions that take values in the PSD cone, which extends kernel sums-of-squares models that were recently proposed to encode non-negative scalar functions. We provide a representer theorem for this class of PSD functions, show that it constitutes a universal approximator of PSD functions, and derive eigenvalue bounds in the case of subsampled equality constraints. We then apply our results to modeling convex functions, by enforcing a kernel sum-of-squares representation of their Hessian, and show that any smooth and strongly convex function may be thus represented. Finally, we illustrate our methods on a PSD matrix-valued regression task, and on scalar-valued convex regression.
    KalmanNet: Neural Network Aided Kalman Filtering for Partially Known Dynamics. (arXiv:2107.10043v2 [eess.SP] UPDATED)
    (2 min) State estimation of dynamical systems in real-time is a fundamental task in signal processing. For systems that are well-represented by a fully known linear Gaussian state space (SS) model, the celebrated Kalman filter (KF) is a low complexity optimal solution. However, both linearity of the underlying SS model and accurate knowledge of it are often not encountered in practice. Here, we present KalmanNet, a real-time state estimator that learns from data to carry out Kalman filtering under non-linear dynamics with partial information. By incorporating the structural SS model with a dedicated recurrent neural network module in the flow of the KF, we retain data efficiency and interpretability of the classic algorithm while implicitly learning complex dynamics from data. We numerically demonstrate that KalmanNet overcomes non-linearities and model mismatch, outperforming classic filtering methods operating with both mismatched and accurate domain knowledge.
    Learning Lipschitz Feedback Policies from Expert Demonstrations: Closed-Loop Guarantees, Generalization and Robustness. (arXiv:2103.16629v2 [cs.LG] UPDATED)
    (2 min) In this work, we propose a framework to learn feedback control policies with guarantees on closed-loop generalization and adversarial robustness. These policies are learned directly from expert demonstrations, contained in a dataset of state-control input pairs, without any prior knowledge of the task and system model. We use a Lipschitz-constrained loss minimization scheme to learn feedback policies with certified closed-loop robustness, wherein the Lipschitz constraint serves as a mechanism to tune the generalization performance and robustness to adversarial disturbances. Our analysis exploits the Lipschitz property to obtain closed-loop guarantees on generalization and robustness of the learned policies. In particular, we derive a finite sample bound on the policy learning error and establish robust closed-loop stability under the learned control policy. We also derive bounds on the closed-loop regret with respect to the expert policy and the deterioration of closed-loop performance under bounded (adversarial) disturbances to the state measurements. Numerical results validate our analysis and demonstrate the effectiveness of our robust feedback policy learning framework. Finally, our results suggest the existence of a potential tradeoff between nominal closed-loop performance and adversarial robustness, and that improvements in nominal closed-loop performance can only be made at the expense of robustness to adversarial perturbations.
    Fast and Efficient MMD-based Fair PCA via Optimization over Stiefel Manifold. (arXiv:2109.11196v3 [stat.ML] UPDATED)
    (2 min) This paper defines fair principal component analysis (PCA) as minimizing the maximum mean discrepancy (MMD) between dimensionality-reduced conditional distributions of different protected classes. The incorporation of MMD naturally leads to an exact and tractable mathematical formulation of fairness with good statistical properties. We formulate the problem of fair PCA subject to MMD constraints as a non-convex optimization over the Stiefel manifold and solve it using the Riemannian Exact Penalty Method with Smoothing (REPMS; Liu and Boumal, 2019). Importantly, we provide local optimality guarantees and explicitly show the theoretical effect of each hyperparameter in practical settings, extending previous results. Experimental comparisons based on synthetic and UCI datasets show that our approach outperforms prior work in explained variance, fairness, and runtime.
    Deep Localization of Mixed Image Tampering Techniques. (arXiv:1904.08484v3 [cs.CV] UPDATED)
    (2 min) With technological advances leading to an increase in mechanisms for image tampering, fraud detection methods must continue to be upgraded to match their sophistication. One problem with current methods is that they require prior knowledge of the method of forgery in order to determine which features to extract from the image to localize the region of interest. When a machine learning algorithm is used to learn different types of tampering from a large set of various image types, with a large enough database we can easily classify which images are tampered. However, we still are left with the question of which features to train on, and how to localize the manipulation. In this work, deep learning for object detection is adapted to tampering detection to solve these two problems, while fusing features from multiple classic techniques for improved accuracy. A Multi-stream version of the Faster RCNN network will be employed with the second stream having an input of the element-wise sum of the ELA and BAG error maps to provide even higher accuracy than a single stream alone.
    Tensor Completion by Multi-Rank via Unitary Transformation. (arXiv:2012.08784v2 [stat.ML] UPDATED)
    (0 min) One of the key problems in tensor completion is the number of uniformly random sample entries required for recovery guarantee. The main aim of this paper is to study $n_1 \times n_2 \times n_3$ third-order tensor completion based on transformed tensor singular value decomposition, and provide a bound on the number of required sample entries. Our approach is to make use of the multi-rank of the underlying tensor instead of its tubal rank in the bound. In numerical experiments on synthetic and imaging data sets, we demonstrate the effectiveness of our proposed bound for the number of sample entries. Moreover, our theoretical results are valid to any unitary transformation applied to $n_3$-dimension under transformed tensor singular value decomposition.
    A Lyapunov-Based Methodology for Constrained Optimization with Bandit Feedback. (arXiv:2106.05165v3 [cs.LG] UPDATED)
    (0 min) In a wide variety of applications including online advertising, contractual hiring, and wireless scheduling, the controller is constrained by a stringent budget constraint on the available resources, which are consumed in a random amount by each action, and a stochastic feasibility constraint that may impose important operational limitations on decision-making. In this work, we consider a general model to address such problems, where each action returns a random reward, cost, and penalty from an unknown joint distribution, and the decision-maker aims to maximize the total reward under a budget constraint $B$ on the total cost and a stochastic constraint on the time-average penalty. We propose a novel low-complexity algorithm based on Lyapunov optimization methodology, named ${\tt LyOn}$, and prove that for $K$ arms it achieves $O(\sqrt{K B\log B})$ regret and zero constraint-violation when $B$ is sufficiently large. The low computational cost and sharp performance bounds of ${\tt LyOn}$ suggest that Lyapunov-based algorithm design methodology can be effective in solving constrained bandit optimization problems.
    Stochastic normalizing flows as non-equilibrium transformations. (arXiv:2201.08862v1 [hep-lat])
    (0 min) Normalizing flows are a class of deep generative models that provide a promising route to sample lattice field theories more efficiently than conventional Monte~Carlo simulations. In this work we show that the theoretical framework of stochastic normalizing flows, in which neural-network layers are combined with Monte~Carlo updates, is the same that underlies out-of-equilibrium simulations based on Jarzynski's equality, which have been recently deployed to compute free-energy differences in lattice gauge theories. We lay out a strategy to optimize the efficiency of this extended class of generative models and present examples of applications.
    Discovering Parametric Activation Functions. (arXiv:2006.03179v5 [cs.LG] UPDATED)
    (0 min) Recent studies have shown that the choice of activation function can significantly affect the performance of deep learning networks. However, the benefits of novel activation functions have been inconsistent and task dependent, and therefore the rectified linear unit (ReLU) is still the most commonly used. This paper proposes a technique for customizing activation functions automatically, resulting in reliable improvements in performance. Evolutionary search is used to discover the general form of the function, and gradient descent to optimize its parameters for different parts of the network and over the learning process. Experiments with four different neural network architectures on the CIFAR-10 and CIFAR-100 image classification datasets show that this approach is effective. It discovers both general activation functions and specialized functions for different architectures, consistently improving accuracy over ReLU and other activation functions by significant margins. The approach can therefore be used as an automated optimization step in applying deep learning to new tasks.
    Approximation bounds for norm constrained neural networks with applications to regression and GANs. (arXiv:2201.09418v1 [cs.LG])
    (0 min) This paper studies the approximation capacity of ReLU neural networks with norm constraint on the weights. We prove upper and lower bounds on the approximation error of these networks for smooth function classes. The lower bound is derived through the Rademacher complexity of neural networks, which may be of independent interest. We apply these approximation bounds to analyze the convergence of regression using norm constrained neural networks and distribution estimation by GANs. In particular, we obtain convergence rates for over-parameterized neural networks. It is also shown that GANs can achieve optimal rate of learning probability distributions, when the discriminator is a properly chosen norm constrained neural network.
    Spectral, Probabilistic, and Deep Metric Learning: Tutorial and Survey. (arXiv:2201.09267v1 [stat.ML])
    (0 min) This is a tutorial and survey paper on metric learning. Algorithms are divided into spectral, probabilistic, and deep metric learning. We first start with the definition of distance metric, Mahalanobis distance, and generalized Mahalanobis distance. In spectral methods, we start with methods using scatters of data, including the first spectral metric learning, relevant methods to Fisher discriminant analysis, Relevant Component Analysis (RCA), Discriminant Component Analysis (DCA), and the Fisher-HSIC method. Then, large-margin metric learning, imbalanced metric learning, locally linear metric adaptation, and adversarial metric learning are covered. We also explain several kernel spectral methods for metric learning in the feature space. We also introduce geometric metric learning methods on the Riemannian manifolds. In probabilistic methods, we start with collapsing classes in both input and feature spaces and then explain the neighborhood component analysis methods, Bayesian metric learning, information theoretic methods, and empirical risk minimization in metric learning. In deep learning methods, we first introduce reconstruction autoencoders and supervised loss functions for metric learning. Then, Siamese networks and its various loss functions, triplet mining, and triplet sampling are explained. Deep discriminant analysis methods, based on Fisher discriminant analysis, are also reviewed. Finally, we introduce multi-modal deep metric learning, geometric metric learning by neural networks, and few-shot metric learning.
    SurvITE: Learning Heterogeneous Treatment Effects from Time-to-Event Data. (arXiv:2110.14001v2 [cs.LG] UPDATED)
    (0 min) We study the problem of inferring heterogeneous treatment effects from time-to-event data. While both the related problems of (i) estimating treatment effects for binary or continuous outcomes and (ii) predicting survival outcomes have been well studied in the recent machine learning literature, their combination -- albeit of high practical relevance -- has received considerably less attention. With the ultimate goal of reliably estimating the effects of treatments on instantaneous risk and survival probabilities, we focus on the problem of learning (discrete-time) treatment-specific conditional hazard functions. We find that unique challenges arise in this context due to a variety of covariate shift issues that go beyond a mere combination of well-studied confounding and censoring biases. We theoretically analyse their effects by adapting recent generalization bounds from domain adaptation and treatment effect estimation to our setting and discuss implications for model design. We use the resulting insights to propose a novel deep learning method for treatment-specific hazard estimation based on balancing representations. We investigate performance across a range of experimental settings and empirically confirm that our method outperforms baselines by addressing covariate shifts from various sources.
    Bayesian Graph Contrastive Learning. (arXiv:2112.07823v2 [cs.LG] UPDATED)
    (0 min) Contrastive learning has become a key component of self-supervised learning approaches for graph-structured data. However, despite their success, existing graph contrastive learning methods are incapable of uncertainty quantification for node representations or their downstream tasks, limiting their application in high-stakes domains. In this paper, we propose a novel Bayesian perspective of graph contrastive learning methods showing random augmentations leads to stochastic encoders. As a result, our proposed method represents each node by a distribution in the latent space in contrast to existing techniques which embed each node to a deterministic vector. By learning distributional representations, we provide uncertainty estimates in downstream graph analytics tasks and increase the expressive power of the predictive model. In addition, we propose a Bayesian framework to infer the probability of perturbations in each view of the contrastive model, eliminating the need for a computationally expensive search for hyperparameter tuning. We empirically show a considerable improvement in performance compared to existing state-of-the-art methods on several benchmark datasets.
    Enhanced spatio-temporal electric load forecasts using less data with active deep learning. (arXiv:2012.04407v2 [cs.LG] UPDATED)
    (0 min) An effective way to oppose global warming and mitigate climate change is to electrify our energy sectors and supply their electric power from renewable wind and solar. Spatio-temporal predictions of electric load become increasingly important for planning this transition, while deep learning prediction models provide increasingly accurate predictions for it. The data used for training deep learning models, however, is usually collected at random using a passive learning approach. This naturally results in a large demand for data and associated costs for sensors like smart meters, posing a large barrier for electric utilities in decarbonizing their grids. Here, we test active learning where we leverage additional computation for collecting a more informative subset of data. We show how electric utilities can apply active learning to better distribute smart meters and collect their data for more accurate predictions of load with about half the data compared to when applying passive learning.
    On Distributed Learning with Constant Communication Bits. (arXiv:2109.06388v2 [cs.IT] UPDATED)
    (0 min) In this paper, we study a distributed learning problem constrained by constant communication bits. Specifically, we consider the distributed hypothesis testing (DHT) problem where two distributed nodes are constrained to transmit a constant number of bits to a central decoder. In such cases, we show that in order to achieve the optimal error exponents, it suffices to consider the empirical distributions of observed data sequences and encode them to the transmission bits. With such a coding strategy, we develop a geometric approach in the distribution spaces and establish an inner bound of error exponent regions. In particular, we show the optimal achievable error exponents and coding schemes for the following cases: (i) both nodes can transmit $\log_23$ bits; (ii) one of the nodes can transmit $1$ bit, and the other node is not constrained; (iii) the joint distribution of the nodes are conditionally independent given one hypothesis. Furthermore, we provide several numerical examples for illustrating the theoretical results. Our results provide theoretical guidance for designing practical distributed learning rules, and the developed approach also reveals new potentials for establishing error exponents for DHT with more general communication constraints.
    Active Learning Polynomial Threshold Functions. (arXiv:2201.09433v1 [cs.LG])
    (0 min) We initiate the study of active learning polynomial threshold functions (PTFs). While traditional lower bounds imply that even univariate quadratics cannot be non-trivially actively learned, we show that allowing the learner basic access to the derivatives of the underlying classifier circumvents this issue and leads to a computationally efficient algorithm for active learning degree-$d$ univariate PTFs in $\tilde{O}(d^3\log(1/\varepsilon\delta))$ queries. We also provide near-optimal algorithms and analyses for active learning PTFs in several average case settings. Finally, we prove that access to derivatives is insufficient for active learning multivariate PTFs, even those of just two variables.
    Optimal Dynamic Regret in Proper Online Learning with Strongly Convex Losses and Beyond. (arXiv:2201.08905v1 [cs.LG])
    (2 min) We study the framework of universal dynamic regret minimization with strongly convex losses. We answer an open problem in Baby and Wang 2021 by showing that in a proper learning setup, Strongly Adaptive algorithms can achieve the near optimal dynamic regret of $\tilde O(d^{1/3} n^{1/3}\text{TV}[u_{1:n}]^{2/3} \vee d)$ against any comparator sequence $u_1,\ldots,u_n$ simultaneously, where $n$ is the time horizon and $\text{TV}[u_{1:n}]$ is the Total Variation of comparator. These results are facilitated by exploiting a number of new structures imposed by the KKT conditions that were not considered in Baby and Wang 2021 which also lead to other improvements over their results such as: (a) handling non-smooth losses and (b) improving the dimension dependence on regret. Further, we also derive near optimal dynamic regret rates for the special case of proper online learning with exp-concave losses and an $L_\infty$ constrained decision set.
    Recurrent Neural Networks with Mixed Hierarchical Structures and EM Algorithm for Natural Language Processing. (arXiv:2201.08919v1 [cs.CL])
    (2 min) How to obtain hierarchical representations with an increasing level of abstraction becomes one of the key issues of learning with deep neural networks. A variety of RNN models have recently been proposed to incorporate both explicit and implicit hierarchical information in modeling languages in the literature. In this paper, we propose a novel approach called the latent indicator layer to identify and learn implicit hierarchical information (e.g., phrases), and further develop an EM algorithm to handle the latent indicator layer in training. The latent indicator layer further simplifies a text's hierarchical structure, which allows us to seamlessly integrate different levels of attention mechanisms into the structure. We called the resulting architecture as the EM-HRNN model. Furthermore, we develop two bootstrap strategies to effectively and efficiently train the EM-HRNN model on long text documents. Simulation studies and real data applications demonstrate that the EM-HRNN model with bootstrap training outperforms other RNN-based models in document classification tasks. The performance of the EM-HRNN model is comparable to a Transformer-based method called Bert-base, though the former is much smaller model and does not require pre-training.
    Weight Expansion: A New Perspective on Dropout and Generalization. (arXiv:2201.09209v1 [cs.LG])
    (2 min) While dropout is known to be a successful regularization technique, insights into the mechanisms that lead to this success are still lacking. We introduce the concept of \emph{weight expansion}, an increase in the signed volume of a parallelotope spanned by the column or row vectors of the weight covariance matrix, and show that weight expansion is an effective means of increasing the generalization in a PAC-Bayesian setting. We provide a theoretical argument that dropout leads to weight expansion and extensive empirical support for the correlation between dropout and weight expansion. To support our hypothesis that weight expansion can be regarded as an \emph{indicator} of the enhanced generalization capability endowed by dropout, and not just as a mere by-product, we have studied other methods that achieve weight expansion (resp.\ contraction), and found that they generally lead to an increased (resp.\ decreased) generalization ability. This suggests that dropout is an attractive regularizer, because it is a computationally cheap method for obtaining weight expansion. This insight justifies the role of dropout as a regularizer, while paving the way for identifying regularizers that promise improved generalization through weight expansion.
    Multi-way Spectral Clustering of Augmented Multi-view Data through Deep Collective Matrix Tri-factorization. (arXiv:2009.05805v2 [cs.LG] UPDATED)
    (2 min) We present the first deep learning based architecture for collective matrix tri-factorization (DCMTF) of arbitrary collections of matrices, also known as augmented multi-view data. DCMTF can be used for multi-way spectral clustering of heterogeneous collections of relational data matrices to discover latent clusters in each input matrix, across both dimensions, as well as the strengths of association across clusters. The source code for DCMTF is available on our public repository: https://bitbucket.org/cdal/dcmtf_generic
    Spherical Poisson Point Process Intensity Function Modeling and Estimation with Measure Transport. (arXiv:2201.09485v1 [stat.ME])
    (2 min) Recent years have seen an increased interest in the application of methods and techniques commonly associated with machine learning and artificial intelligence to spatial statistics. Here, in a celebration of the ten-year anniversary of the journal Spatial Statistics, we bring together normalizing flows, commonly used for density function estimation in machine learning, and spherical point processes, a topic of particular interest to the journal's readership, to present a new approach for modeling non-homogeneous Poisson process intensity functions on the sphere. The central idea of this framework is to build, and estimate, a flexible bijective map that transforms the underlying intensity function of interest on the sphere into a simpler, reference, intensity function, also on the sphere. Map estimation can be done efficiently using automatic differentiation and stochastic gradient descent, and uncertainty quantification can be done straightforwardly via nonparametric bootstrap. We investigate the viability of the proposed method in a simulation study, and illustrate its use in a proof-of-concept study where we model the intensity of cyclone events in the North Pacific Ocean. Our experiments reveal that normalizing flows present a flexible and straightforward way to model intensity functions on spheres, but that their potential to yield a good fit depends on the architecture of the bijective map, which can be difficult to establish in practice.
    A Statistical Recurrent Stochastic Volatility Model for Stock Markets. (arXiv:1906.02884v3 [econ.EM] UPDATED)
    (2 min) The Stochastic Volatility (SV) model and its variants are widely used in the financial sector while recurrent neural network (RNN) models are successfully used in many large-scale industrial applications of Deep Learning. Our article combines these two methods in a non-trivial way and proposes a model, which we call the Statistical Recurrent Stochastic Volatility (SR-SV) model, to capture the dynamics of stochastic volatility. The proposed model is able to capture complex volatility effects (e.g., non-linearity and long-memory auto-dependence) overlooked by the conventional SV models, is statistically interpretable and has an impressive out-of-sample forecast performance. These properties are carefully discussed and illustrated through extensive simulation studies and applications to five international stock index datasets: The German stock index DAX30, the Hong Kong stock index HSI50, the France market index CAC40, the US stock market index SP500 and the Canada market index TSX250. An user-friendly software package together with the examples reported in the paper are available at \url{https://github.com/vbayeslab}.
    Recurrent Conditional Heteroskedasticity. (arXiv:2010.13061v2 [econ.EM] UPDATED)
    (2 min) We propose a new class of financial volatility models, called the REcurrent Conditional Heteroskedastic (RECH) models, to improve both in-sample analysis and out-ofsample forecasting of the traditional conditional heteroskedastic models. In particular, we incorporate auxiliary deterministic processes, governed by recurrent neural networks, into the conditional variance of the traditional conditional heteroskedastic models, e.g. GARCH-type models, to flexibly capture the dynamics of the underlying volatility. RECH models can detect interesting effects in financial volatility overlooked by the existing conditional heteroskedastic models such as the GARCH, GJR and EGARCH. The new models often have good out-of-sample forecasts while still explaining well the stylized facts of financial volatility by retaining the well-established features of econometric GARCH-type models. These properties are illustrated through simulation studies and applications to thirty-one stock indices and exchange rate data. . An user-friendly software package together with the examples reported in the paper are available at https://github.com/vbayeslab.
    A Novel Framework for Online Supervised Learning with Feature Selection. (arXiv:1803.11521v9 [stat.ML] UPDATED)
    (2 min) Current online learning methods suffer issues such as lower convergence rates and limited capability to recover the support of the true features compared to their offline counterparts. In this paper, we present a novel framework for online learning based on running averages and introduce a series of online versions of popular offline methods such as Elastic Net, Minimax Concave Penalty, and Feature Selection with Annealing. The framework can handle an arbitrarily large number of observations with the restriction that the data dimension is not too large, e.g. p<50,000. We prove the equivalence between our online methods and their offline counterparts and give theoretical true feature recovery and convergence guarantees for some of them. In contrast to existing online methods, the proposed methods can extract models with any desired sparsity level at any time. Numerical experiments indicate that our new methods enjoy high true feature recovery accuracy and a fast convergence rate, compared with standard online and offline algorithms. We also show how the running averages framework can be used for model adaptation in the presence of model drift. Finally, we present applications to large datasets where again the proposed framework shows competitive results compared to popular online and offline algorithms.
    A Machine Learning Framework for Distributed Functional Compression over Wireless Channels in IoT. (arXiv:2201.09483v1 [cs.LG])
    (2 min) IoT devices generating enormous data and state-of-the-art machine learning techniques together will revolutionize cyber-physical systems. In many diverse fields, from autonomous driving to augmented reality, distributed IoT devices compute specific target functions without simple forms like obstacle detection, object recognition, etc. Traditional cloud-based methods that focus on transferring data to a central location either for training or inference place enormous strain on network resources. To address this, we develop, to the best of our knowledge, the first machine learning framework for distributed functional compression over both the Gaussian Multiple Access Channel (GMAC) and orthogonal AWGN channels. Due to the Kolmogorov-Arnold representation theorem, our machine learning framework can, by design, compute any arbitrary function for the desired functional compression task in IoT. Importantly the raw sensory data are never transferred to a central node for training or inference, thus reducing communication. For these algorithms, we provide theoretical convergence guarantees and upper bounds on communication. Our simulations show that the learned encoders and decoders for functional compression perform significantly better than traditional approaches, are robust to channel condition changes and sensor outages. Compared to the cloud-based scenario, our algorithms reduce channel use by two orders of magnitude.
    Speaker Diarization with LSTM. (arXiv:1710.10468v7 [eess.AS] UPDATED)
    (2 min) For many years, i-vector based audio embedding techniques were the dominant approach for speaker verification and speaker diarization applications. However, mirroring the rise of deep learning in various domains, neural network based audio embeddings, also known as d-vectors, have consistently demonstrated superior speaker verification performance. In this paper, we build on the success of d-vector based speaker verification systems to develop a new d-vector based approach to speaker diarization. Specifically, we combine LSTM-based d-vector audio embeddings with recent work in non-parametric clustering to obtain a state-of-the-art speaker diarization system. Our system is evaluated on three standard public datasets, suggesting that d-vector based diarization systems offer significant advantages over traditional i-vector based systems. We achieved a 12.0% diarization error rate on NIST SRE 2000 CALLHOME, while our model is trained with out-of-domain data from voice search logs.
    PAC Mode Estimation using PPR Martingale Confidence Sequences. (arXiv:2109.05047v2 [stat.ME] UPDATED)
    (2 min) We consider the problem of correctly identifying the mode of a discrete distribution $\mathcal{P}$ with sufficiently high probability by observing a sequence of i.i.d. samples drawn according to $\mathcal{P}$. This problem reduces to the estimation of a single parameter when $\mathcal{P}$ has a support set of size $K = 2$. Noting the efficiency of prior-posterior-ratio (PPR) martingale confidence sequences for handling this special case, we propose a generalisation to mode estimation, in which $\mathcal{P}$ may take $K \geq 2$ values. We observe that the "one-versus-one" principle yields a more efficient generalisation than the "one-versus-rest" alternative. Our resulting stopping rule, denoted PPR-ME, is optimal in its sample complexity up to a logarithmic factor. Moreover, PPR-ME empirically outperforms several other competing approaches for mode estimation. We demonstrate the gains offered by PPR-ME in two practical applications: (1) sample-based forecasting of the winner in indirect election systems, and (2) efficient verification of smart contracts in permissionless blockchains.
    Probability Distribution on Full Rooted Trees. (arXiv:2109.12825v4 [stat.ML] UPDATED)
    (2 min) The recursive and hierarchical structure of full rooted trees is applicable to represent statistical models in various areas, such as data compression, image processing, and machine learning. In most of these cases, the full rooted tree is not a random variable; as such, model selection to avoid overfitting becomes problematic. A method to solve this problem is to assume a prior distribution on the full rooted trees. This enables the optimal model selection based on the Bayes decision theory. For example, by assigning a low prior probability to a complex model, the maximum a posteriori estimator prevents the selection of the complex one. Furthermore, we can average all the models weighted by their posteriors. In this paper, we propose a probability distribution on a set of full rooted trees. Its parametric representation is suitable for calculating the properties of our distribution using recursive functions, such as the mode, expectation, and posterior distribution. Although such distributions have been proposed in previous studies, they are only applicable to specific applications. Therefore, we extract their mathematically essential components and derive new generalized methods to calculate the expectation, posterior distribution, etc.
    Optimal SQ Lower Bounds for Learning Halfspaces with Massart Noise. (arXiv:2201.09818v1 [cs.LG])
    (2 min) We give tight statistical query (SQ) lower bounds for learnining halfspaces in the presence of Massart noise. In particular, suppose that all labels are corrupted with probability at most $\eta$. We show that for arbitrary $\eta \in [0,1/2]$ every SQ algorithm achieving misclassification error better than $\eta$ requires queries of superpolynomial accuracy or at least a superpolynomial number of queries. Further, this continues to hold even if the information-theoretically optimal error $\mathrm{OPT}$ is as small as $\exp\left(-\log^c(d)\right)$, where $d$ is the dimension and $0 < c < 1$ is an arbitrary absolute constant, and an overwhelming fraction of examples are noiseless. Our lower bound matches known polynomial time algorithms, which are also implementable in the SQ framework. Previously, such lower bounds only ruled out algorithms achieving error $\mathrm{OPT} + \epsilon$ or error better than $\Omega(\eta)$ or, if $\eta$ is close to $1/2$, error $\eta - o_\eta(1)$, where the term $o_\eta(1)$ is constant in $d$ but going to 0 for $\eta$ approaching $1/2$. As a consequence, we also show that achieving misclassification error better than $1/2$ in the $(A,\alpha)$-Tsybakov model is SQ-hard for $A$ constant and $\alpha$ bounded away from 1.
    The Many Faces of Adversarial Risk. (arXiv:2201.08956v1 [stat.ML])
    (2 min) Adversarial risk quantifies the performance of classifiers on adversarially perturbed data. Numerous definitions of adversarial risk -- not all mathematically rigorous and differing subtly in the details -- have appeared in the literature. In this paper, we revisit these definitions, make them rigorous, and critically examine their similarities and differences. Our technical tools derive from optimal transport, robust statistics, functional analysis, and game theory. Our contributions include the following: generalizing Strassen's theorem to the unbalanced optimal transport setting with applications to adversarial classification with unequal priors; showing an equivalence between adversarial robustness and robust hypothesis testing with $\infty$-Wasserstein uncertainty sets; proving the existence of a pure Nash equilibrium in the two-player game between the adversary and the algorithm; and characterizing adversarial risk by the minimum Bayes error between a pair of distributions belonging to the $\infty$-Wasserstein uncertainty sets. Our results generalize and deepen recently discovered connections between optimal transport and adversarial robustness and reveal new connections to Choquet capacities and game theory.
    An Infinite-Feature Extension for Bayesian ReLU Nets That Fixes Their Asymptotic Overconfidence. (arXiv:2010.02709v5 [cs.LG] UPDATED)
    (2 min) A Bayesian treatment can mitigate overconfidence in ReLU nets around the training data. But far away from them, ReLU Bayesian neural networks (BNNs) can still underestimate uncertainty and thus be asymptotically overconfident. This issue arises since the output variance of a BNN with finitely many features is quadratic in the distance from the data region. Meanwhile, Bayesian linear models with ReLU features converge, in the infinite-width limit, to a particular Gaussian process (GP) with a variance that grows cubically so that no asymptotic overconfidence can occur. While this may seem of mostly theoretical interest, in this work, we show that it can be used in practice to the benefit of BNNs. We extend finite ReLU BNNs with infinite ReLU features via the GP and show that the resulting model is asymptotically maximally uncertain far away from the data while the BNNs' predictive power is unaffected near the data. Although the resulting model approximates a full GP posterior, thanks to its structure, it can be applied \emph{post-hoc} to any pre-trained ReLU BNN at a low cost.
    Effective Learning of a GMRF Mixture Model. (arXiv:2005.09030v3 [cs.LG] UPDATED)
    (2 min) Learning a Gaussian Mixture Model (GMM) is hard when the number of parameters is too large given the amount of available data. As a remedy, we propose restricting the GMM to a Gaussian Markov Random Field Mixture Model (GMRF-MM), as well as a new method for estimating the latter's sparse precision (i.e., inverse covariance) matrices. When the sparsity pattern of each matrix is known, we propose an efficient optimization method for the Maximum Likelihood Estimate (MLE) of that matrix. When it is unknown, we utilize the popular Graphical Least Absolute Shrinkage and Selection Operator (GLASSO) to estimate that pattern. However, we show that even for a single Gaussian, when GLASSO is tuned to successfully estimate the sparsity pattern, it does so at the price of a substantial bias of the values of the nonzero entries of the matrix, and we show that this problem only worsens in a mixture setting. To overcome this, we discard the nonzero values estimated by GLASSO, keep only its pattern estimate and use it within the proposed MLE method. This yields an effective two-step procedure that removes the bias. We show that our "debiasing" approach outperforms GLASSO in both the single-GMRF and the GMRF-MM cases. We also show that when learning priors for image patches, our method outperforms GLASSO even if we merely use an educated guess about the sparsity pattern, and that our GMRF-MM outperforms the baseline GMM on real and synthetic high-dimensional datasets.
    Optimal Estimation and Computational Limit of Low-rank Gaussian Mixtures. (arXiv:2201.09040v1 [math.ST])
    (2 min) Structural matrix-variate observations routinely arise in diverse fields such as multi-layer network analysis and brain image clustering. While data of this type have been extensively investigated with fruitful outcomes being delivered, the fundamental questions like its statistical optimality and computational limit are largely under-explored. In this paper, we propose a low-rank Gaussian mixture model (LrMM) assuming each matrix-valued observation has a planted low-rank structure. Minimax lower bounds for estimating the underlying low-rank matrix are established allowing a whole range of sample sizes and signal strength. Under a minimal condition on signal strength, referred to as the information-theoretical limit or statistical limit, we prove the minimax optimality of a maximum likelihood estimator which, in general, is computationally infeasible. If the signal is stronger than a certain threshold, called the computational limit, we design a computationally fast estimator based on spectral aggregation and demonstrate its minimax optimality. Moreover, when the signal strength is smaller than the computational limit, we provide evidences based on the low-degree likelihood ratio framework to claim that no polynomial-time algorithm can consistently recover the underlying low-rank matrix. Our results reveal multiple phase transitions in the minimax error rates and the statistical-to-computational gap. Numerical experiments confirm our theoretical findings. We further showcase the merit of our spectral aggregation method on the worldwide food trading dataset.
    Overcoming Oversmoothness in Graph Convolutional Networks via Hybrid Scattering Networks. (arXiv:2201.08932v1 [stat.ML])
    (2 min) Geometric deep learning (GDL) has made great strides towards generalizing the design of structure-aware neural network architectures from traditional domains to non-Euclidean ones, such as graphs. This gave rise to graph neural network (GNN) models that can be applied to graph-structured datasets arising, for example, in social networks, biochemistry, and material science. Graph convolutional networks (GCNs) in particular, inspired by their Euclidean counterparts, have been successful in processing graph data by extracting structure-aware features. However, current GNN models (and GCNs in particular) are known to be constrained by various phenomena that limit their expressive power and ability to generalize to more complex graph datasets. Most models essentially rely on low-pass filtering of graph signals via local averaging operations, thus leading to oversmoothing. Here, we propose a hybrid GNN framework that combines traditional GCN filters with band-pass filters defined via the geometric scattering transform. We further introduce an attention framework that allows the model to locally attend over the combined information from different GNN filters at the node level. Our theoretical results establish the complementary benefits of the scattering filters to leverage structural information from the graph, while our experiments show the benefits of our method on various learning tasks.
    Ensemble Inference Methods for Models With Noisy and Expensive Likelihoods. (arXiv:2104.03384v2 [math.NA] UPDATED)
    (2 min) The increasing availability of data presents an opportunity to calibrate unknown parameters which appear in complex models of phenomena in the biomedical, physical and social sciences. However, model complexity often leads to parameter-to-data maps which are expensive to evaluate and are only available through noisy approximations. This paper is concerned with the use of interacting particle systems for the solution of the resulting inverse problems for parameters. Of particular interest is the case where the available forward model evaluations are subject to rapid fluctuations, in parameter space, superimposed on the smoothly varying large scale parametric structure of interest. {A motivating example from climate science is presented, and ensemble Kalman methods (which do not use the derivative of the parameter-to-data map) are shown, empirically, to perform well. Multiscale analysis is then used to analyze the behaviour of interacting particle system algorithms when rapid fluctuations, which we refer to as noise, pollute the large scale parametric dependence of the parameter-to-data map. Ensemble Kalman methods and Langevin-based methods} (the latter use the derivative of the parameter-to-data map) are compared in this light. The ensemble Kalman methods are shown to behave favourably in the presence of noise in the parameter-to-data map, whereas Langevin methods are adversely affected. On the other hand, Langevin methods have the correct equilibrium distribution in the setting of noise-free forward models, whilst ensemble Kalman methods only provide an uncontrolled approximation, except in the linear case. Therefore a new class of algorithms, ensemble Gaussian process samplers, which combine the benefits of both ensemble Kalman and Langevin methods, are introduced and shown to perform favourably.
    Universal Online Learning with Unbounded Losses: Memory Is All You Need. (arXiv:2201.08903v1 [stat.ML])
    (2 min) We resolve an open problem of Hanneke on the subject of universally consistent online learning with non-i.i.d. processes and unbounded losses. The notion of an optimistically universal learning rule was defined by Hanneke in an effort to study learning theory under minimal assumptions. A given learning rule is said to be optimistically universal if it achieves a low long-run average loss whenever the data generating process makes this goal achievable by some learning rule. Hanneke posed as an open problem whether, for every unbounded loss, the family of processes admitting universal learning are precisely those having a finite number of distinct values almost surely. In this paper, we completely resolve this problem, showing that this is indeed the case. As a consequence, this also offers a dramatically simpler formulation of an optimistically universal learning rule for any unbounded loss: namely, the simple memorization rule already suffices. Our proof relies on constructing random measurable partitions of the instance space and could be of independent interest for solving other open questions. We extend the results to the non-realizable setting thereby providing an optimistically universal Bayes consistent learning rule.
    An Attention-based ConvLSTM Autoencoder with Dynamic Thresholding for Unsupervised Anomaly Detection in Multivariate Time Series. (arXiv:2201.09172v1 [cs.LG])
    (2 min) As a substantial amount of multivariate time series data is being produced by the complex systems in Smart Manufacturing, improved anomaly detection frameworks are needed to reduce the operational risks and the monitoring burden placed on the system operators. However, building such frameworks is challenging, as a sufficiently large amount of defective training data is often not available and frameworks are required to capture both the temporal and contextual dependencies across different time steps while being robust to noise. In this paper, we propose an unsupervised Attention-based Convolutional Long Short-Term Memory (ConvLSTM) Autoencoder with Dynamic Thresholding (ACLAE-DT) framework for anomaly detection and diagnosis in multivariate time series. The framework starts by pre-processing and enriching the data, before constructing feature images to characterize the system statuses across different time steps by capturing the inter-correlations between pairs of time series. Afterwards, the constructed feature images are fed into an attention-based ConvLSTM autoencoder, which aims to encode the constructed feature images and capture the temporal behavior, followed by decoding the compressed knowledge representation to reconstruct the feature images input. The reconstruction errors are then computed and subjected to a statistical-based, dynamic thresholding mechanism to detect and diagnose the anomalies. Evaluation results conducted on real-life manufacturing data demonstrate the performance strengths of the proposed approach over state-of-the-art methods under different experimental settings.
    Probability Distribution on Rooted Trees. (arXiv:2201.09460v1 [cs.LG])
    (2 min) The hierarchical and recursive expressive capability of rooted trees is applicable to represent statistical models in various areas, such as data compression, image processing, and machine learning. On the other hand, such hierarchical expressive capability causes a problem in tree selection to avoid overfitting. One unified approach to solve this is a Bayesian approach, on which the rooted tree is regarded as a random variable and a direct loss function can be assumed on the selected model or the predicted value for a new data point. However, all the previous studies on this approach are based on the probability distribution on full trees, to the best of our knowledge. In this paper, we propose a generalized probability distribution for any rooted trees in which only the maximum number of child nodes and the maximum depth are fixed. Furthermore, we derive recursive methods to evaluate the characteristics of the probability distribution without any approximations.
    VCG Mechanism Design with Unknown Agent Values under Stochastic Bandit Feedback. (arXiv:2004.08924v4 [stat.ML] UPDATED)
    (2 min) We study a multi-round welfare-maximising mechanism design problem in instances where agents do not know their values. On each round, a mechanism first assigns an allocation each to a set of agents and charges them a price; at the end of the round, the agents provide (stochastic) feedback to the mechanism for the allocation they received. This setting is motivated by applications in cloud markets and online advertising where an agent may know her value for an allocation only after experiencing it. Therefore, the mechanism needs to explore different allocations for each agent so that it can learn their values, while simultaneously attempting to find the socially optimal set of allocations. Our focus is on truthful and individually rational mechanisms which imitate the classical VCG mechanism in the long run. To that end, we first define three notions of regret for the welfare, the individual utilities of each agent and that of the mechanism. We show that these three terms are interdependent via an $\Omega(T^{\frac{2}{3}})$ lower bound for the maximum of these three terms after $T$ rounds of allocations, and describe an algorithm which essentially achieves this rate. Our framework also provides flexibility to control the pricing scheme so as to trade-off between the agent and seller regrets. Next, we define asymptotic variants for the truthfulness and individual rationality requirements and provide asymptotic rates to quantify the degree to which both properties are satisfied by the proposed algorithm.
    Multiscale Generative Models: Improving Performance of a Generative Model Using Feedback from Other Dependent Generative Models. (arXiv:2201.09644v1 [cs.LG])
    (2 min) Realistic fine-grained multi-agent simulation of real-world complex systems is crucial for many downstream tasks such as reinforcement learning. Recent work has used generative models (GANs in particular) for providing high-fidelity simulation of real-world systems. However, such generative models are often monolithic and miss out on modeling the interaction in multi-agent systems. In this work, we take a first step towards building multiple interacting generative models (GANs) that reflects the interaction in real world. We build and analyze a hierarchical set-up where a higher-level GAN is conditioned on the output of multiple lower-level GANs. We present a technique of using feedback from the higher-level GAN to improve performance of lower-level GANs. We mathematically characterize the conditions under which our technique is impactful, including understanding the transfer learning nature of our set-up. We present three distinct experiments on synthetic data, time series data, and image domain, revealing the wide applicability of our technique.
    How to scale hyperparameters for quickshift image segmentation. (arXiv:2201.09286v1 [cs.CV])
    (2 min) Quickshift is a popular algorithm for image segmentation, used as a preprocessing step in many applications. Unfortunately, it is quite challenging to understand the hyperparameters' influence on the number and shape of superpixels produced by the method. In this paper, we study theoretically a slightly modified version of the quickshift algorithm, with a particular emphasis on homogeneous image patches with i.i.d. pixel noise and sharp boundaries between such patches. Leveraging this analysis, we derive a simple heuristic to scale quickshift hyperparameters when dealing with real images, which we check empirically.
    Machine Learning Symmetry. (arXiv:2201.09345v1 [hep-th])
    (2 min) We review recent work in machine learning aspects of conformal field theory and Lie algebra representation theory using neural networks.
    Robust Wavelet-based Assessment of Scaling with Applications. (arXiv:2201.09320v1 [stat.ME])
    (2 min) A number of approaches have dealt with statistical assessment of self-similarity, and many of those are based on multiscale concepts. Most rely on certain distributional assumptions which are usually violated by real data traces, often characterized by large temporal or spatial mean level shifts, missing values or extreme observations. A novel, robust approach based on Theil-type weighted regression is proposed for estimating self-similarity in two-dimensional data (images). The method is compared to two traditional estimation techniques that use wavelet decompositions; ordinary least squares (OLS) and Abry-Veitch bias correcting estimator (AV). As an application, the suitability of the self-similarity estimate resulting from the the robust approach is illustrated as a predictive feature in the classification of digitized mammogram images as cancerous or non-cancerous. The diagnostic employed here is based on the properties of image backgrounds, which is typically an unused modality in breast cancer screening. Classification results show nearly 68% accuracy, varying slightly with the choice of wavelet basis, and the range of multiresolution levels used.
    Fisher Task Distance and Its Applications in Neural Architecture Search and Transfer Learning. (arXiv:2103.12827v4 [cs.LG] UPDATED)
    (2 min) We formulate an asymmetric (or non-commutative) distance between tasks based on FisherInformation Matrices. We provide a proof of consistency for our distance through theorems and experiments on various classification tasks. We then apply our proposed measure of task distance in transfer learning on visual tasks in the Taskonomy dataset. Additionally, we show how the proposed distance between a target task and a set of baseline tasks can be used to reduce the neural architecture search space for the target task. The complexity reduction in search space for task-specific architectures is achieved by building on the optimized architectures for similar tasks instead of doing a full search and without using this side information. Experimental results demonstrate the efficacy of the proposed approach and its improvements over other methods.
    Joint Estimation and Inference for Data Integration Problems based on Multiple Multi-layered Gaussian Graphical Models. (arXiv:1803.03348v3 [stat.ML] UPDATED)
    (2 min) The rapid development of high-throughput technologies has enabled the generation of data from biological or disease processes that span multiple layers, like genomic, proteomic or metabolomic data, and further pertain to multiple sources, like disease subtypes or experimental conditions. In this work, we propose a general statistical framework based on Gaussian graphical models for horizontal (i.e. across conditions or subtypes) and vertical (i.e. across different layers containing data on molecular compartments) integration of information in such datasets. We start with decomposing the multi-layer problem into a series of two-layer problems. For each two-layer problem, we model the outcomes at a node in the lower layer as dependent on those of other nodes in that layer, as well as all nodes in the upper layer. We use a combination of neighborhood selection and group-penalized regression to obtain sparse estimates of all model parameters. Following this, we develop a debiasing technique and asymptotic distributions of inter-layer directed edge weights that utilize already computed neighborhood selection coefficients for nodes in the upper layer. Subsequently, we establish global and simultaneous testing procedures for these edge weights. Performance of the proposed methodology is evaluated on synthetic and real data.
    A Causal Lens for Controllable Text Generation. (arXiv:2201.09119v1 [cs.CL])
    (2 min) Controllable text generation concerns two fundamental tasks of wide applications, namely generating text of given attributes (i.e., attribute-conditional generation), and minimally editing existing text to possess desired attributes (i.e., text attribute transfer). Extensive prior work has largely studied the two problems separately, and developed different conditional models which, however, are prone to producing biased text (e.g., various gender stereotypes). This paper proposes to formulate controllable text generation from a principled causal perspective which models the two tasks with a unified framework. A direct advantage of the causal formulation is the use of rich causality tools to mitigate generation biases and improve control. We treat the two tasks as interventional and counterfactual causal inference based on a structural causal model, respectively. We then apply the framework to the challenging practical setting where confounding factors (that induce spurious correlations) are observable only on a small fraction of data. Experiments show significant superiority of the causal approach over previous conditional models for improved control accuracy and reduced bias.
    Nonconvex Stochastic Scaled-Gradient Descent and Generalized Eigenvector Problems. (arXiv:2112.14738v2 [stat.ML] UPDATED)
    (0 min) Motivated by the problem of online canonical correlation analysis, we propose the \emph{Stochastic Scaled-Gradient Descent} (SSGD) algorithm for minimizing the expectation of a stochastic function over a generic Riemannian manifold. SSGD generalizes the idea of projected stochastic gradient descent and allows the use of scaled stochastic gradients instead of stochastic gradients. In the special case of a spherical constraint, which arises in generalized eigenvector problems, we establish a nonasymptotic finite-sample bound of $\sqrt{1/T}$, and show that this rate is minimax optimal, up to a polylogarithmic factor of relevant parameters. On the asymptotic side, a novel trajectory-averaging argument allows us to achieve local asymptotic normality with a rate that matches that of Ruppert-Polyak-Juditsky averaging. We bring these ideas together in an application to online canonical correlation analysis, deriving, for the first time in the literature, an optimal one-time-scale algorithm with an explicit rate of local asymptotic convergence to normality. Numerical studies of canonical correlation analysis are also provided for synthetic data.
  • Google AI Blog

    Resolving High-Energy Impacts on Quantum Processors
    (7 min) Posted by Matt McEwen, Student Researcher, Google Quantum AI and Lara Faoro, Research Fellow, LPTHE- Sorbonne Université and CNRS (Paris) Quantum processors are made of superconducting quantum bits (qubits) that — being quantum objects — are highly susceptible to even tiny amounts of environmental noise. This noise can cause errors in quantum computation that need to be addressed to continue advancing quantum computers. Our Sycamore processors are installed in specially designed cryostats, where they are sealed away from stray light and electromagnetic fields and are cooled down to very low temperatures to reduce thermal noise. However, the world is full of high-energy radiation. In fact, there’s a tiny background of high-energy gamma rays and muons that pass through everything around u…
  • Data Science Central

    Cloud Data Platforms and the TMGT(Too-Much-of-a-Good-Thing) Effect
    (5 min) “The too-much-of-a-good-thing (TMGT) effect occurs when an initially positive relation between an antecedent and a desirable outcome variable turns negative when the underlying ordinarily beneficial antecedent is taken too far, such that the overall relation becomes nonmonotonic.”  The ABC for Studying the Too-Much-of-a-Good-Thing Effect: A Competitive Mediation Framework Linking Antecedents, Benefits, and Costs; Christian Busse,… Read More »Cloud Data Platforms and the TMGT(Too-Much-of-a-Good-Thing) Effect The post Cloud Data Platforms and the TMGT(Too-Much-of-a-Good-Thing) Effect appeared first on Data Science Central.
    DSC Weekly Digest for 25, 2022: They’re Coming To Take My Job Away!
    (7 min) Over the last thirty years, there have been two camps that have emerged (I’m not sure that calling them schools of thought is even warranted) taking one of the two positions: 1) The advent of robots and computers would take away my job, or 2) The computer and robotics age will herald the rise of… Read More »DSC Weekly Digest for 25, 2022: They’re Coming To Take My Job Away! The post DSC Weekly Digest for 25, 2022: They’re Coming To Take My Job Away! appeared first on Data Science Central.
  • The Official NVIDIA Blog

    Hatch Me If You Can: Startup’s Sorting Machines Use AI to Protect Healthy Fish Eggs
    (3 min) Fisheries collect millions upon millions of fish eggs, protecting them from predators to increase fish yield and support the propagation of endangered species — but an issue with gathering so many eggs at once is that those infected with parasites can put healthy ones at risk. Jensorter, an Oregon-based startup, has created AI-powered fish egg Read article > The post Hatch Me If You Can: Startup’s Sorting Machines Use AI to Protect Healthy Fish Eggs appeared first on The Official NVIDIA Blog.
    UK Biobank Advances Genomics Research with NVIDIA Clara Parabricks
    (4 min) UK Biobank is broadening scientists’ access to high-quality genomic data and analysis by making its massive dataset available in the cloud alongside NVIDIA GPU-accelerated analysis tools. Used by more than 25,000 registered researchers around the world, UK Biobank is a large-scale biomedical database and research resource with deidentified genetic datasets, along with medical imaging and Read article > The post UK Biobank Advances Genomics Research with NVIDIA Clara Parabricks appeared first on The Official NVIDIA Blog.
  • MIT News - Artificial intelligence

    Deploying machine learning to improve mental health
    (6 min) MIT scientist Rosalind Picard collaborates with clinicians to develop tools for mental health care delivery.
    Cynthia Breazeal named dean for digital learning at MIT
    (4 min) Overseeing business and research units across MIT Open Learning, Breazeal will focus on the future of digital technologies and their applications in education.
  • AWS Machine Learning Blog

    How Logz.io accelerates ML recommendations and anomaly detection solutions with Amazon SageMaker
    (11 min) Logz.io is an AWS Partner Network (APN) Advanced Technology Partner with AWS Competencies in DevOps, Security, and Data & Analytics. Logz.io offers a software as a service (SaaS) observability platform based on best-in-class open-source software solutions for log, metric, and tracing analytics. Customers are sending an increasing amount of data to Logz.io from various data […]
  • Becoming Human: Artificial Intelligence Magazine - Medium

    10 Best C++ Courses for Beginners and Experienced Developers
    (9 min) My favorite C++ online courses to learning C++ for beginners, game development, and experienced programmers from Udemy, Coursera…
  • John D. Cook

    A general theory of sub-things
    (3 min) When I took my first abstract algebra course, we had a homework question about subgroups. Someone in the class whined that the professor hadn’t told us yet what a subgroup was. My immediate thought was “I bet you could guess. Sub things are all the same. A sub-X is an X contained inside another X. […] A general theory of sub-things first appeared on John D. Cook.
    How to subscribe
    (1 min) I so despise pop-ups asking me to subscribe to a site’s newsletter that I’ve probably erred on the side of making it too hard to find out how to subscribe to this blog. At the bottom of the “writing” menu is a subscribe link. You can subscribe three different ways: Subscribe to posts via RSS […] How to subscribe first appeared on John D. Cook.
    Non-equivalent floating point numbers
    (2 min) Here’s an interesting paragraph from Handbook of Floating Point Arithmetic by Jean-Michel Muller et al. Many common transformations of a code into some naively equivalent code become impossible if one takes into account special haves such as NaNs, signed zeros, and rounding modes. For instance, the expression x + 0 is not equivalent to the expression x […] Non-equivalent floating point numbers first appeared on John D. Cook.
    Temporal and polymodal logics
    (2 min) My posts on modal logic have mostly been about monomodal logic, logic with one modal operator. This may not seem accurate because I’ve talked about □ (“box”) and ◇ (“diamond”). But these are really just one mode: you can define either in terms of the other. ◇p = ¬ □ ¬p □p = ¬ ◇ […] Temporal and polymodal logics first appeared on John D. Cook.
  • Artificial Intelligence

    Death Robot Spider Crabs (AI Video Art)
    (0 min) submitted by /u/glenniszen [link] [comments]
    InspiroBot interesting thoughts
    (0 min) submitted by /u/Significant_Ship_184 [link] [comments]
    OpenAI Releases Three Embedding Model Families To Optimize Text Search, Code Search and Text Similarity
    (1 min) In the last few decades, neural networks have been used for a wide range of tasks, including image segmentation, natural language processing, and time-series forecasting. One promising use of deep neural networks is embedding, a method for representing discrete variables as continuous vectors. An embedding is a low-dimensional space into which high-dimensional vectors can be translated, making it easy for computers to understand the relationships between those concepts. Numerically similar embeddings are also semantically identical. Word embeddings for machine translation and entity embeddings for categorical data are two applications of this approach. Continue Reading Paper: https://arxiv.org/abs/2201.10005 Documentation: https://beta.openai.com/docs/guides/embeddings submitted by /u/techsucker [link] [comments]
    How to edit videos with StyleGAN- Stitch it in Time: GAN-Based Facial Editing of Real Videos - 5-minute paper summary (by Casual GAN Papers)
    (1 min) What do you do after mastering image editing? One possible answer is to move on to video editing, a significantly more challenging task due to the inherent lack of temporal coherency in existing inversion and editing methods. Nevertheless, Rotem Tzaban and the team at The Blavatnik School of Computer Science and Tel Aviv University show that a StyleGAN is all you need. Well, a StyleGAN and several insightful tweaks to the frame-by-frame inversion and editing pipeline to obtain a method that produces temporally consistent high-quality edited videos, and yes, that includes CLIP-guided editing. With the overview part out of the way, let’s dive into the details. Full summary: https://t.me/casual_gan/245 Blog post: https://www.casualganpapers.com/hiqh_quality_video_editing_stylegan_inversion/Stitch-It-In-Time-explained.html Stitch it in Time arxiv / code Subscribe to Casual GAN Papers and follow me on Twitter for weekly AI paper summaries! submitted by /u/KirillTheMunchKing [link] [comments]
    Microsoft can Leverage AI in Healthcare Industry
    (0 min) submitted by /u/Beautiful-Credit-868 [link] [comments]
    What is DeepSpeed?
    (0 min) submitted by /u/Beautiful-Credit-868 [link] [comments]
    User Interface Opinion Survey
    (1 min) User Interface Opinion Survey ​ UX (User eXperience) mASI Client Study (Question Set 3) The survey will take approximately 10 minutes to complete. ​ In this question set you are a user of a software system that is helping to train and collect emotional data about something. In the questions you will be asked how you would perform a specified task with the associated screen. Answers do not need to be long but give enough detail of how you would expect that screen to operate. None of these screens will be what will necessarily be used. Your feedback is kept anonymous. Try to answer these with your first instinctual response. ​ There's no wrong answers here, or prior knowledge needed. ​ https://forms.office.com/r/NaGep4nK1v submitted by /u/jasonpGeringer [link] [comments]
    CVPR 2021 Best Paper Award: GIRAFFE - Controllable Image Generation
    (0 min) submitted by /u/OnlyProggingForFun [link] [comments]
    Friday coffee talk! Free webinar on Data Privacy and AI on January 28th, 2022. In this free webinar, you will get answers among others to how to build a culture of privacy in a growth environment or the impact, challenges, and importance of transparency. 🔐
    (1 min) submitted by /u/Agreeable-Ad-8307 [link] [comments]
    Artificial Nightmares : Godfrey Lord King of Lions || Pytti VQGAN AI Art Video [4K 60 FPS]
    (0 min) submitted by /u/Thenamessd [link] [comments]
    Chrome Browser Extension to get Research Paper Details
    (1 min) https://chrome.google.com/webstore/detail/paperslabmlai/agkeoekbbkpokfbpmmminnombikadbhb Identifies research papers mentioned in websites you visit, and the extension shows you the following details about research papers, Two-line summary. Availability of source code, videos, Reddit and Hacker News discussions. Popularity on Twitter. Conferences. Also, we released the source code of the extension, Github: https://github.com/labmlai/chrome-extension We love to hear your feedback and suggestions. Thank you all, and I appreciate the support. submitted by /u/hnipun [link] [comments]
    If I wanted to make a identity collecting AI how would I start?
    (1 min) Hi everyone, I’m wondering if anyone has a idea upon how I would go about creating a AI that attempts to find important data about individuals based on a name and other factors. THIS IS A PROOF OF CONCEPT !!! submitted by /u/Alext2v2 [link] [comments]
    Introduction To Federated Learning: Enabling The Scaling Of Machine Learning Across Decentralized Data Whilst Preserving Data Privacy
    (1 min) Large volumes of data are required for training machine learning models. The trained model is run on a cloud server that users can access through various applications such as web search, translation, text production, and picture processing, which is the standard procedure for establishing machine learning applications. The application must transfer the user’s data to the server where the machine learning model is stored every time it wishes to use it, creating privacy, security, and processing issues. Fortunately, developments in edge AI have allowed sensitive user data to be avoided from being sent to application servers. This current area of study, also known as TinyML, aims to construct machine learning models that fit smartphones and other consumer devices, making on-device inference possible. Even if the device is not connected to the internet, these applications can continue functioning. The on-device inference is more energy-efficient in many applications than transferring data to the cloud. However, the data is still required to train the models installed on customers’ devices. When the entity generating the models already owns the data (e.g., a bank owns its transactions) or the data is public information, this isn’t a problem (e.g., Wikipedia or news articles). But, acquiring training data for machine learning models that leverage confidential user information such as emails, chat logs, or personal images poses numerous obstacles. Read The Full Article: https://www.marktechpost.com/2022/01/25/introduction-to-federated-learning-enabling-the-scaling-of-machine-learning-across-decentralized-data-whilst-preserving-data-privacy/ ​ https://preview.redd.it/mhixr329qxd81.png?width=1536&format=png&auto=webp&s=d365cf34f7d630ae49bf8c7088d8342cfe0f1366 submitted by /u/techsucker [link] [comments]
    WebtoonMe project (video cartoonization)
    (0 min) I would like to share my interesting photo/video-cartoonization project. Please check out the cool demo videos :-) https://webtoon.github.io/WebtoonMe/en submitted by /u/jis478 [link] [comments]
  • Machine Learning

    [D] On initialization schemes for MLPs: practice and theory
    (1 min) In this post I am thinking only about MLPs with ReLU activation function. The default pytorch initialization for linear layers is from a uniform distribution centered at 0 whose limits values depends on the input dimension. Many papers assume initialization from a Gaussian distribution with 0 mean and provide a certain variance. There is also this work [Pennington (2017)] that proposes orthogonal initialization to achieve what they call dynamical isometry which means that the input-output Jacobian is 1 (or stays near), which in turns implies that gradients do not explore nor vanish. There is also this result [Hu (2020)] that shows with orthogonal initialization one can achieve linear convergence (convergence that decreases as k^t with K<1 and t being training step) for linear networks …
    Are you a Machine Learning Engineer interested in Supporting Ethics in your Everyday work? [R]
    (1 min) Hi everyone, I am a Researcher at UXP2 Lab Purdue University working on an NSF funded project to support practitioners to improve ethics in their everyday practice. If you are a Machine Learning Engineer who likes to think about ethics in your everyday work, I would love for you to attend one of our virtual workshops. This three-hour workshop will engage you in co-design activities with 3-5 other technology practitioners to build, iterate on, and implement ethics-focused action plans based on provided prompts and your personal experiences. A $50 incentive will be provided for your participation. Fill out the following form to express your interest in this study: https://forms.gle/Ac89zyJLdTrnaHVS9 We are looking for practitioners who are currently employed in roles that include (but are not limited to): Software Engineering, Data Science, Front/Back-end Development, Product Management, User Experience (UX), and other design or technology personnel responsible for the development of digital systems in any industry or governmental context. You can learn more about the overall grant project at https://everydayethics.uxp2.com. I am looking forward to seeing you at the virtual workshop. Thank you! If you have any questions, please feel free to message me! submitted by /u/rikeob [link] [comments]
    [D] Google FLoC and Topics API suspiciously similar.
    (2 min) If you work in ad tech in any machine-learning capacity you've probably been following the development of Google's Privacy sandbox, their Federated Learning of Cohorts, and today, their substitute Topics API. After significant backlash from the ad space, they effectively killed FLoC today and present an alternative to it. Common elements to both: In-browser. 2) In browser definition of topics. 3) Will still require some deployment of machine learning models in-browser to make it work. So, basically... the same? These are the core elements of FLoC, ad it seems of Topics API. ​ More elements: FLoC "The browser uses machine learning algorithms to develop a cohort based on the sites that an individual visits. The algorithms might be based on the URLs of the visited sites, on the content of those pages, or other factors. The central idea is that these input features to the algorithm, including the web history, are kept local on the browser and are not uploaded elsewhere — the browser only exposes the generated cohort." Source: https://github.com/WICG/floc ​ Topics API "With Topics, your browser determines a handful of topics, like “Fitness” or “Travel & Transportation,” that represent your top interests for that week based on your browsing history. Topics are kept for only three weeks and old topics are deleted. Topics are selected entirely on your device without involving any external servers, including Google servers." Source: https://blog.google/products/chrome/get-know-new-topics-api-privacy-sandbox/ "The topics are selected from an advertising taxonomy. (...) The topics will be inferred by the browser. The browser will leverage a classifier model to map site hostnames to topics." Source: https://github.com/jkarlin/topics. ​ Discuss. submitted by /u/xlr_ [link] [comments]
    [News] [ICLR 2022 Workshop] ML Evaluation Standards
    (1 min) The field of ML is undergoing massive growth, and it is becoming apparent that it may be in need of self-reflection to ensure that efforts are directed towards real progress. More recently, there is an increasing number of papers at top conferences on the topic of ML evaluation, which show evidence of non-reliable findings and unsupported empirical claims in several subfields including computer vision, recommender systems, reinforcement learning, natural language processing, hyperparameter optimization and more. Such papers highlight the need for more scientific rigour and careful evaluation, both by researchers themselves and the reviewers. Thus, it seems that these are discussions that researchers are interested in having, while it is yet unclear what is the best path forward. To this end, we are organizing a workshop on "Setting ML Evaluation Standards to Accelerate Progress" at ICLR 2022. We invite two types of papers – opinion papers (up to 4 pages) stating positions on the topics related to those listed above, and methodology papers (up to 8 pages excluding references) about evaluation in ML. These topics may include: Establishing benchmarking standards for ML research Reliable tools/protocols for benchmarking and evaluation Understanding and defining reproducibility for machine learning Meta analyses thoroughly evaluating existing claims across papers Incentives for doing better evaluation and reporting results Submission Site: cmt3.research.microsoft.com/SMILES2022. For more details, please see the call for papers at ml-eval.github.io/call-for-papers/. Submission deadline is March 4. For a summary of the workshop, see this twitter thread. submitted by /u/smallest_meta_review [link] [comments]
    [D] Anyone work at a bank? How do you document predictive models just in case they are audited?
    (4 min) Hey all, I work at a bank and am about to start building my first predictive model. I'm curious how you document your models in case auditors ask to see them? I'm also meeting with our internal auditors next week to come up with a plan, but I'd love to know what you do at your organization if you are willing to share. Any advice? submitted by /u/InjuryNeat7483 [link] [comments]
    [N] We did a wiki for ML experts/noobs with algotrading background
    (0 min) Hello guys, this weekend, we made a wiki with notion for Machine Learning enthusiasts in Algotrading. It scrapes data such as news, social media, github repos, research papers and other things. A feedback would be very kind on what we can improve :) https://mltraders.wiki/ submitted by /u/GarantBM [link] [comments]
    [P] Probabilistic Memory Retrieval
    (1 min) I got this idea after reading the blogpost about RETRO model which was written by Jay Alammar. The main idea is to combine the BERT model and the database into one probabilistic model. This probabilistic model approximates ("assumption") the database and helps to find the neighbouring latent vectors for the inputs. For the past few weeks, in my free time, I was trying to solve this problem. This article is just showing a proof of concept of my idea. In my opinion the idea require more refinement but due to lack of resources I might not able to proceed further. Article: https://gananath.github.io/prob_retrieval.html Github: https://github.com/Gananath/probablistic_memory_retrieval submitted by /u/gananath [link] [comments]
    [D] How to edit videos with StyleGAN- Stitch it in Time: GAN-Based Facial Editing of Real Videos - 5-minute paper summary (by Casual GAN Papers)
    (1 min) What do you do after mastering image editing? One possible answer is to move on to video editing, a significantly more challenging task due to the inherent lack of temporal coherency in existing inversion and editing methods. Nevertheless, Rotem Tzaban and the team at The Blavatnik School of Computer Science and Tel Aviv University show that a StyleGAN is all you need. Well, a StyleGAN and several insightful tweaks to the frame-by-frame inversion and editing pipeline to obtain a method that produces temporally consistent high-quality edited videos, and yes, that includes CLIP-guided editing. With the overview part out of the way, let’s dive into the details. Full summary: https://t.me/casual_gan/245 Blog post: https://www.casualganpapers.com/hiqh_quality_video_editing_stylegan_inversion/Stitch-It-In-Time-explained.html Stitch it in Time arxiv / code Subscribe to Casual GAN Papers and follow me on Twitter for weekly AI paper summaries! submitted by /u/KirillTheMunchKing [link] [comments]
    [D] Are black-box explainers (e.g. SHAP) just another black-box?
    (5 min) Tools such as SHAP are now very common to explain "black box" models such as GBMs or Neural Networks. However, people who use them often don't really understand how they work. And these tools are no less complex than the models they would like to explain. Aren't we just explaining "black box" models with black box explainers? [link] [comments]
    Question - best generative AI to match an illustration style? [P]
    (1 min) Hey all, I want to load in a series of illustrated images and the have the AI produce more image to match them same style. I've tried StyleGAN2 and the Looking Glass notebook. Are there other AI's that are good for this or for finetuning matching styles? Cheers submitted by /u/maaartiin_mac [link] [comments]
    [R] The Path to Machine Intelligence: Classic AI vs. Deep Learning vs. Biological Approach
    (1 min) In Numenta’s latest blog post, they define, compare and evaluate three different approaches to building smart machines: Classic AI, Deep Learning, and Biological Neural Networks. https://numenta.com/blog/2022/01/25/the-path-to-machine-intelligence submitted by /u/hardmaru [link] [comments]
    [R] Prospective Learning: Back to the Future
    (0 min) submitted by /u/hardmaru [link] [comments]
    [P] I built a robot to protect my birdfeeder from squirrels using computer vision
    (4 min) Last summer, I got real pissed off that squirrels kept eating the birdseed from my feeder, leaving none for the birds. So I did what any self-respecting ML dilettante would do: I built a robot to detect squirrels. When it sees a squirrel, the robot (really just a raspberry pi) sprays them with the hose. Yes, there's video: https://www.youtube.com/watch?v=gP8qG0vLN78&feature=emb_title Technically speaking, it's a Raspberry Pi with a camera that takes a picture every 30 seconds. Then it runs it through a classifier hosted on AWS Lambda I trained with Fast.ai. When the model thinks it sees a positive, the Pi applies current to a taped-up, incorrectly-threaded solenoid valve hooked up to the sprayer nozzle zip-tied to a tomato cage. It also records a 7 second video of the panicked, wet squirrel, for the lulz. As with any model, distribution shifts are a major challenge... but here, that just means "mourning doves show up for the first time, the model sees a big gray thing on the feeder and says, 'Oh! A squirrel!' and sprays it." But it works! The rate of squirrels at my feeder declined over time once I turned on the Squirrel Soaker 9000. (The rate declined even more once I bought a squirrel baffle. 🙃 ) Blogpost: https://jeremybmerrill.com/blog/2022/01/squirrel-soaker-9000-repelling-squirrels-with-ai.html And don't worry, I have seen that Mark Roper video. submitted by /u/jeremybmerrill [link] [comments]
  • Neural Networks, Deep Learning and Machine Learning

    CNN for detection and tracking of players in sports (e.g. basketball, soccer,..)?
    (1 min) I am a newbie and I wonder if CNNs would be the right choice to detect and track the running path of players in sports videos. I got couple of questions regarding that, maybe someone more experienced can help me. Are CNNs still the state of the art for image recognition? Are there any (preferably simple) NNs used for object tracking, that I could use as a starting point? Sorry for these noob questions. submitted by /u/Supermarket3000 [link] [comments]
    On the Difficulty of Extrapolation with NN Scaling
    (0 min) submitted by /u/nickb [link] [comments]
  • Reinforcement Learning

    SAC: Enforcing Action Bounds formula derivation
    (1 min) In the SAC paper https://arxiv.org/pdf/1812.05905.pdf there is a chapter `C Enforcing Action Bounds`. I struggle to understand how they got that formula `π(a|s) = µ(u|s) det|da/du|^-1`. Can anyone help me with that? submitted by /u/CharlottHebb [link] [comments]
    Average reward algorithms when reward distribution changes
    (2 min) Assume my RL agent has to manage the load balancing of a series of servers. This problem fits very well in the average reward formulation, since we do not have episodes, but only an infinite-length task where we want to optimize the average throughput and minimize the average delay. Now, assume that the traffic on servers is high during the day and low during the night. Therefore, the average reward that the agent can achieve will depend on the time of the day. Over a long period of time, the average reward R that we subtract in avg. reward Bellman equations would be an average over both day and night. But this way, when the agent operates during the night, the reward it will get will look higher than average, while during the day, when the traffic is high, the reward will look less than average. How can avg. reward algorithms deal with this? Does the average reward R need to be taken over a smaller window of the last rewards? Or using an exponential average with an exp parameter that is sensitive enough to these intra-day variations in the average reward? Or does the reward function need to be invariant to the time of the day? This might be difficult to achieve though submitted by /u/fedetask [link] [comments]
    Combining reward functions with different scales and meaning
    (3 min) What are the best practices to use a reward function that is a combination of several types of rewards, that can have very different scales and meanings? Take the example of a robot serving customers at a restaurant. I want it to maximize the number of dishes it serves during the day, but I also want to penalize it for making customers wait more than a certain amount of time. Note that without the penalization the robot might decide to never serve a customer that requested a complicated dish because serving only customers with simple dishes is more rewarding. The reward would be something like r = w_1 * r_1 + w_2 * r_2, where r_1 is +1 for each served customer and r_2 is -wait_time of customers waiting more than a threshold. w_1 and w_2 are weights to trade off this behavior. More generally, I can have a reward function made of several components like that. One idea is to scale each reward by a constant so that all rewards are in [-1, 1] and I can effectively use the weights for the trade-off. However, this is not possible for many reward functions (e.g. waiting time that can be infinite), and even when it's possible, it's tricky to find the right constant that maintains the rewards around the desired value but doesn't under- or over-scale them. Another idea is to keep a window of the last N rewards of each component of the reward function, and then transform each component by subtracting the mean and dividing by stdev. Do you think this option is viable? What effect would it have on the overall value function? Do you know any other technique? submitted by /u/fedetask [link] [comments]
    "Evolution Gym: A Large-Scale Benchmark for Evolving Soft Robots", Bhatia et al 2022
    (1 min) submitted by /u/gwern [link] [comments]
    "Environment Generation for Zero-Shot Compositional Reinforcement Learning", Gur et al 2022
    (0 min) submitted by /u/gwern [link] [comments]

2022-01-25

  • cs.LG updates on arXiv.org

    You Never Cluster Alone. (arXiv:2106.01908v3 [cs.CV] UPDATED)
    (2 min) Recent advances in self-supervised learning with instance-level contrastive objectives facilitate unsupervised clustering. However, a standalone datum is not perceiving the context of the holistic cluster, and may undergo sub-optimal assignment. In this paper, we extend the mainstream contrastive learning paradigm to a cluster-level scheme, where all the data subjected to the same cluster contribute to a unified representation that encodes the context of each data group. Contrastive learning with this representation then rewards the assignment of each datum. To implement this vision, we propose twin-contrast clustering (TCC). We define a set of categorical variables as clustering assignment confidence, which links the instance-level learning track with the cluster-level one. On one hand, with the corresponding assignment variables being the weight, a weighted aggregation along the data points implements the set representation of a cluster. We further propose heuristic cluster augmentation equivalents to enable cluster-level contrastive learning. On the other hand, we derive the evidence lower-bound of the instance-level contrastive objective with the assignments. By reparametrizing the assignment variables, TCC is trained end-to-end, requiring no alternating steps. Extensive experiments show that TCC outperforms the state-of-the-art on challenging benchmarks.
    A generalization of the randomized singular value decomposition. (arXiv:2105.13052v3 [math.NA] UPDATED)
    (2 min) The randomized singular value decomposition (SVD) is a popular and effective algorithm for computing a near-best rank $k$ approximation of a matrix $A$ using matrix-vector products with standard Gaussian vectors. Here, we generalize the randomized SVD to multivariate Gaussian vectors, allowing one to incorporate prior knowledge of $A$ into the algorithm. This enables us to explore the continuous analogue of the randomized SVD for Hilbert--Schmidt (HS) operators using operator-function products with functions drawn from a Gaussian process (GP). We then construct a new covariance kernel for GPs, based on weighted Jacobi polynomials, which allows us to rapidly sample the GP and control the smoothness of the randomly generated functions. Numerical examples on matrices and HS operators demonstrate the applicability of the algorithm.
    Representing Long-Range Context for Graph Neural Networks with Global Attention. (arXiv:2201.08821v1 [cs.LG])
    (2 min) Graph neural networks are powerful architectures for structured datasets. However, current methods struggle to represent long-range dependencies. Scaling the depth or width of GNNs is insufficient to broaden receptive fields as larger GNNs encounter optimization instabilities such as vanishing gradients and representation oversmoothing, while pooling-based approaches have yet to become as universally useful as in computer vision. In this work, we propose the use of Transformer-based self-attention to learn long-range pairwise relationships, with a novel "readout" mechanism to obtain a global graph embedding. Inspired by recent computer vision results that find position-invariant attention performant in learning long-range relationships, our method, which we call GraphTrans, applies a permutation-invariant Transformer module after a standard GNN module. This simple architecture leads to state-of-the-art results on several graph classification tasks, outperforming methods that explicitly encode graph structure. Our results suggest that purely-learning-based approaches without graph structure may be suitable for learning high-level, long-range relationships on graphs. Code for GraphTrans is available at https://github.com/ucbrise/graphtrans.
    MaGNET: Uniform Sampling from Deep Generative Network Manifolds Without Retraining. (arXiv:2110.08009v3 [cs.LG] UPDATED)
    (2 min) Deep Generative Networks (DGNs) are extensively employed in Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs), and their variants to approximate the data manifold and distribution. However, training samples are often distributed in a non-uniform fashion on the manifold, due to costs or convenience of collection. For example, the CelebA dataset contains a large fraction of smiling faces. These inconsistencies will be reproduced when sampling from the trained DGN, which is not always preferred, e.g., for fairness or data augmentation. In response, we develop MaGNET, a novel and theoretically motivated latent space sampler for any pre-trained DGN, that produces samples uniformly distributed on the learned manifold. We perform a range of experiments on various datasets and DGNs, e.g., for the state-of-the-art StyleGAN2 trained on FFHQ dataset, uniform sampling via MaGNET increases distribution precision and recall by 4.1\% \& 3.0\% and decreases gender bias by 41.2\%, without requiring labels or retraining. As uniform distribution does not imply uniform semantic distribution, we also explore separately how semantic attributes of generated samples vary under MaGNET sampling.
    Learning Body-Aware 3D Shape Generative Models. (arXiv:2112.07022v3 [cs.GR] UPDATED)
    (2 min) The shape of many objects in the built environment is dictated by their relationships to the human body: how will a person interact with this object? Existing data-driven generative models of 3D shapes produce plausible objects but do not reason about the relationship of those objects to the human body. In this paper, we learn body-aware generative models of 3D shapes. Specifically, we train generative models of chairs, an ubiquitous shape category, which can be conditioned on a given body shape or sitting pose. The body-shape-conditioned models produce chairs which will be comfortable for a person with the given body shape; the pose-conditioned models produce chairs which accommodate the given sitting pose. To train these models, we define a "sitting pose matching" metric and a novel "sitting comfort" metric. Calculating these metrics requires an expensive optimization to sit the body into the chair, which is too slow to be used as a loss function for training a generative model. Thus, we train neural networks to efficiently approximate these metrics. We use our approach to train three body-aware generative shape models: a structured part-based generator, a point cloud generator, and an implicit surface generator. In all cases, our approach produces models which adapt their output chair shapes to input human body specifications.
    Constrained Policy Gradient Method for Safe and Fast Reinforcement Learning: a Neural Tangent Kernel Based Approach. (arXiv:2107.09139v2 [cs.LG] UPDATED)
    (2 min) This paper presents a constrained policy gradient algorithm. We introduce constraints for safe learning with the following steps. First, learning is slowed down (lazy learning) so that the episodic policy change can be computed with the help of the policy gradient theorem and the neural tangent kernel. Then, this enables us the evaluation of the policy at arbitrary states too. In the same spirit, learning can be guided, ensuring safety via augmenting episode batches with states where the desired action probabilities are prescribed. Finally, exogenous discounted sum of future rewards (returns) can be computed at these specific state-action pairs such that the policy network satisfies constraints. Computing the returns is based on solving a system of linear equations (equality constraints) or a constrained quadratic program (inequality constraints, regional constraints). Simulation results suggest that adding constraints (external information) to the learning can improve learning in terms of speed and transparency reasonably if constraints are appropriately selected. The efficiency of the constrained learning was demonstrated with a shallow and wide ReLU network in the Cartpole and Lunar Lander OpenAI gym environments. The main novelty of the paper is giving a practical use of the neural tangent kernel in reinforcement learning.
    Dictionary-based Low-Rank Approximations and the Mixed Sparse Coding problem. (arXiv:2111.12399v2 [cs.LG] UPDATED)
    (2 min) Constrained tensor and matrix factorization models allow to extract interpretable patterns from multiway data. Therefore identifiability properties and efficient algorithms for constrained low-rank approximations are nowadays important research topics. This work deals with columns of factor matrices of a low-rank approximation being sparse in a known and possibly overcomplete basis, a model coined as Dictionary-based Low-Rank Approximation (DLRA). While earlier contributions focused on finding factor columns inside a dictionary of candidate columns, i.e. one-sparse approximations, this work is the first to tackle DLRA with sparsity larger than one. I propose to focus on the sparse-coding subproblem coined Mixed Sparse-Coding (MSC) that emerges when solving DLRA with an alternating optimization strategy. Several algorithms based on sparse-coding heuristics (greedy methods, convex relaxations) are provided to solve MSC. The performance of these heuristics is evaluated on simulated data. Then, I show how to adapt an efficient MSC solver based on the LASSO to compute Dictionary-based Matrix Factorization and Canonical Polyadic Decomposition in the context of hyperspectral image processing and chemometrics. These experiments suggest that DLRA extends the modeling capabilities of low-rank approximations, helps reducing estimation variance and enhances the identifiability and interpretability of estimated factors.
    Evaluation of Machine Learning Techniques for Forecast Uncertainty Quantification. (arXiv:2111.14844v4 [cs.LG] UPDATED)
    (2 min) Producing an accurate weather forecast and a reliable quantification of its uncertainty is an open scientific challenge. Ensemble forecasting is, so far, the most successful approach to produce relevant forecasts along with an estimation of their uncertainty. The main limitations of ensemble forecasting are the high computational cost and the difficulty to capture and quantify different sources of uncertainty, particularly those associated with model errors. In this work proof-of-concept model experiments are conducted to examine the performance of ANNs trained to predict a corrected state of the system and the state uncertainty using only a single deterministic forecast as input. We compare different training strategies: one based on a direct training using the mean and spread of an ensemble forecast as target, the other ones rely on an indirect training strategy using a deterministic forecast as target in which the uncertainty is implicitly learned from the data. For the last approach two alternative loss functions are proposed and evaluated, one based on the data observation likelihood and the other one based on a local estimation of the error. The performance of the networks is examined at different lead times and in scenarios with and without model errors. Experiments using the Lorenz'96 model show that the ANNs are able to emulate some of the properties of ensemble forecasts like the filtering of the most unpredictable modes and a state-dependent quantification of the forecast uncertainty. Moreover, ANNs provide a reliable estimation of the forecast uncertainty in the presence of model error.
    Non-negative matrix factorization algorithms greatly improve topic model fits. (arXiv:2105.13440v2 [stat.ML] UPDATED)
    (2 min) We report on the potential for using algorithms for non-negative matrix factorization (NMF) to improve parameter estimation in topic models. While several papers have studied connections between NMF and topic models, none have suggested leveraging these connections to develop new algorithms for fitting topic models. NMF avoids the "sum-to-one" constraints on the topic model parameters, resulting in an optimization problem with simpler structure and more efficient computations. Building on recent advances in optimization algorithms for NMF, we show that first solving the NMF problem then recovering the topic model fit can produce remarkably better fits, and in less time, than standard algorithms for topic models. While we focus primarily on maximum likelihood estimation, we show that this approach also has the potential to improve variational inference for topic models. Our methods are implemented in the R package fastTopics.
    Lipschitz Bandits with Batched Feedback. (arXiv:2110.09722v3 [cs.LG] UPDATED)
    (2 min) In this paper, we study Lipschitz bandit problems with batched feedback, where the expected reward is Lipschitz and the reward observations are communicated to the player in batches. We introduce a novel landscape-aware algorithm, called Batched Lipschitz Narrowing (BLiN), that optimally solves this problem. Specifically, we show that for a $T$-step problem with Lipschitz reward of zooming dimension $d_z$, our algorithm achieves theoretically optimal regret rate of $ \widetilde{\mathcal{O}} \left( T^{\frac{d_z + 1}{d_z + 2}} \right) $ using only $ \mathcal{O} \left( \log\log T\right) $ batches. We also provide complexity analysis for this problem. Our theoretical lower bound implies that $\widetilde{\Omega}(\log\log T)$ batches are necessary for any algorithm to achieve the optimal regret. Thus, up to logarithmic factors, BLiN achieves optimal regret rate using minimal communication.
    Robust and Fully-Dynamic Coreset for Continuous-and-Bounded Learning (With Outliers) Problems. (arXiv:2107.00068v2 [cs.LG] UPDATED)
    (2 min) In many machine learning tasks, a common approach for dealing with large-scale data is to build a small summary, {\em e.g.,} coreset, that can efficiently represent the original input. However, real-world datasets usually contain outliers and most existing coreset construction methods are not resilient against outliers (in particular, an outlier can be located arbitrarily in the space by an adversarial attacker). In this paper, we propose a novel robust coreset method for the {\em continuous-and-bounded learning} problems (with outliers) which includes a broad range of popular optimization objectives in machine learning, {\em e.g.,} logistic regression and $ k $-means clustering. Moreover, our robust coreset can be efficiently maintained in fully-dynamic environment. To the best of our knowledge, this is the first robust and fully-dynamic coreset construction method for these optimization problems. Another highlight is that our coreset size can depend on the doubling dimension of the parameter space, rather than the VC dimension of the objective function which could be very large or even challenging to compute. Finally, we conduct the experiments on real-world datasets to evaluate the effectiveness of our proposed robust coreset method.
    Contrastive Learning with Stronger Augmentations. (arXiv:2104.07713v2 [cs.CV] UPDATED)
    (2 min) Representation learning has significantly been developed with the advance of contrastive learning methods. Most of those methods have benefited from various data augmentations that are carefully designated to maintain their identities so that the images transformed from the same instance can still be retrieved. However, those carefully designed transformations limited us to further explore the novel patterns exposed by other transformations. Meanwhile, as found in our experiments, the strong augmentations distorted the images' structures, resulting in difficult retrieval. Thus, we propose a general framework called Contrastive Learning with Stronger Augmentations~(CLSA) to complement current contrastive learning approaches. Here, the distribution divergence between the weakly and strongly augmented images over the representation bank is adopted to supervise the retrieval of strongly augmented queries from a pool of instances. Experiments on the ImageNet dataset and downstream datasets showed the information from the strongly augmented images can significantly boost the performance. For example, CLSA achieves top-1 accuracy of 76.2% on ImageNet with a standard ResNet-50 architecture with a single-layer classifier fine-tuned, which is almost the same level as 76.5% of supervised results. The code and pre-trained models are available in https://github.com/maple-research-lab/CLSA.
    Meta Learning MDPs with Linear Transition Models. (arXiv:2201.08732v1 [cs.LG])
    (2 min) We study meta-learning in Markov Decision Processes (MDP) with linear transition models in the undiscounted episodic setting. Under a task sharedness metric based on model proximity we study task families characterized by a distribution over models specified by a bias term and a variance component. We then propose BUC-MatrixRL, a version of the UC-Matrix RL algorithm, and show it can meaningfully leverage a set of sampled training tasks to quickly solve a test task sampled from the same task distribution by learning an estimator of the bias parameter of the task distribution. The analysis leverages and extends results in the learning to learn linear regression and linear bandit setting to the more general case of MDP's with linear transition models. We prove that compared to learning the tasks in isolation, BUC-Matrix RL provides significant improvements in the transfer regret for high bias low variance task distributions.
    Particle Cloud Generation with Message Passing Generative Adversarial Networks. (arXiv:2106.11535v3 [cs.LG] UPDATED)
    (2 min) In high energy physics (HEP), jets are collections of correlated particles produced ubiquitously in particle collisions such as those at the CERN Large Hadron Collider (LHC). Machine learning (ML)-based generative models, such as generative adversarial networks (GANs), have the potential to significantly accelerate LHC jet simulations. However, despite jets having a natural representation as a set of particles in momentum-space, a.k.a. a particle cloud, there exist no generative models applied to such a dataset. In this work, we introduce a new particle cloud dataset (JetNet), and apply to it existing point cloud GANs. Results are evaluated using (1) 1-Wasserstein distances between high- and low-level feature distributions, (2) a newly developed Fr\'{e}chet ParticleNet Distance, and (3) the coverage and (4) minimum matching distance metrics. Existing GANs are found to be inadequate for physics applications, hence we develop a new message passing GAN (MPGAN), which outperforms existing point cloud GANs on virtually every metric and shows promise for use in HEP. We propose JetNet as a novel point-cloud-style dataset for the ML community to experiment with, and set MPGAN as a benchmark to improve upon for future generative models. Additionally, to facilitate research and improve accessibility and reproducibility in this area, we release the open-source JetNet Python package with interfaces for particle cloud datasets, implementations for evaluation and loss metrics, and more tools for ML in HEP development.
    GAP-Gen: Guided Automatic Python Code Generation. (arXiv:2201.08810v1 [cs.PL])
    (2 min) Automatic code generation from natural language descriptions can be highly beneficial during the process of software development. In this work, we propose GAP-Gen, an automatic code generation method guided by Python syntactic constraints and semantic constraints. We first introduce Python syntactic constraints in the form of Syntax-Flow, which is a simplified version of Abstract Syntax Tree (AST) reducing the size and high complexity of Abstract Syntax Tree but maintaining the crucial syn-tactic information of Python code. In addition to Syntax-Flow, we introduce Variable-Flow which abstracts variable and function names consistently throughout the code. In our work, rather than pre-training, we focus on modifying the fine-tuning process which reduces computational requirements but retains high generation performance on automatic Python code generation task. GAP-Gen fine-tunes the transformer-based language models T5 and CodeT5 using the Code-to-Docstring datasets CodeSearchNet, CodeSearchNet AdvTest, and Code-Docstring-Corpus from EdinburghNLP. Our experiments show that GAP-Gen achieves better results on automatic Python code generation task than previous works
    Learning Two-Step Hybrid Policy for Graph-Based Interpretable Reinforcement Learning. (arXiv:2201.08520v1 [cs.LG])
    (2 min) We present a two-step hybrid reinforcement learning (RL) policy that is designed to generate interpretable and robust hierarchical policies on the RL problem with graph-based input. Unlike prior deep reinforcement learning policies parameterized by an end-to-end black-box graph neural network, our approach disentangles the decision-making process into two steps. The first step is a simplified classification problem that maps the graph input to an action group where all actions share a similar semantic meaning. The second step implements a sophisticated rule-miner that conducts explicit one-hop reasoning over the graph and identifies decisive edges in the graph input without the necessity of heavy domain knowledge. This two-step hybrid policy presents human-friendly interpretations and achieves better performance in terms of generalization and robustness. Extensive experimental studies on four levels of complex text-based games have demonstrated the superiority of the proposed method compared to the state-of-the-art.
    BEVT: BERT Pretraining of Video Transformers. (arXiv:2112.01529v2 [cs.CV] UPDATED)
    (2 min) This paper studies the BERT pretraining of video transformers. It is a straightforward but worth-studying extension given the recent success from BERT pretraining of image transformers. We introduce BEVT which decouples video representation learning into spatial representation learning and temporal dynamics learning. In particular, BEVT first performs masked image modeling on image data, and then conducts masked image modeling jointly with masked video modeling on video data. This design is motivated by two observations: 1) transformers learned on image datasets provide decent spatial priors that can ease the learning of video transformers, which are often times computationally-intensive if trained from scratch; 2) discriminative clues, i.e., spatial and temporal information, needed to make correct predictions vary among different videos due to large intra-class and inter-class variations. We conduct extensive experiments on three challenging video benchmarks where BEVT achieves very promising results. On Kinetics 400, for which recognition mostly relies on discriminative spatial representations, BEVT achieves comparable results to strong supervised baselines. On Something-Something-V2 and Diving 48, which contain videos relying on temporal dynamics, BEVT outperforms by clear margins all alternative baselines and achieves state-of-the-art performance with a 71.4% and 87.2% Top-1 accuracy respectively.
    Nash Convergence of Mean-Based Learning Algorithms in First Price Auctions. (arXiv:2110.03906v2 [cs.GT] UPDATED)
    (2 min) Understanding the convergence properties of learning dynamics in repeated auctions is a timely and important question in the area of learning in auctions, with numerous applications in, e.g., online advertising markets. This work focuses on repeated first price auctions where bidders with fixed values for the item learn to bid using mean-based algorithms -- a large class of online learning algorithms that include popular no-regret algorithms such as Multiplicative Weights Update and Follow the Perturbed Leader. We completely characterize the learning dynamics of mean-based algorithms, in terms of convergence to a Nash equilibrium of the auction, in two senses: (1) time-average: the fraction of rounds where bidders play a Nash equilibrium approaches 1 in the limit; (2)last-iterate: the mixed strategy profile of bidders approaches a Nash equilibrium in the limit. Specifically, the results depend on the number of bidders with the highest value: - If the number is at least three, the bidding dynamics almost surely converges to a Nash equilibrium of the auction, both in time-average and in last-iterate. - If the number is two, the bidding dynamics almost surely converges to a Nash equilibrium in time-average but not necessarily in last-iterate. - If the number is one, the bidding dynamics may not converge to a Nash equilibrium in time-average nor in last-iterate. Our discovery opens up new possibilities in the study of convergence dynamics of learning algorithms.
    Preventing Value Function Collapse in Ensemble {Q}-Learning by Maximizing Representation Diversity. (arXiv:2006.13823v3 [cs.LG] UPDATED)
    (2 min) The classic DQN algorithm is limited by the overestimation bias of the learned Q-function. Subsequent algorithms have proposed techniques to reduce this problem, without fully eliminating it. Recently, the Maxmin and Ensemble Q-learning algorithms have used different estimates provided by the ensembles of learners to reduce the overestimation bias. Unfortunately, these learners can converge to the same point in the parametric or representation space, falling back to the classic single neural network DQN. In this paper, we describe a regularization technique to maximize ensemble diversity in these algorithms. We propose and compare five regularization functions inspired from economics theory and consensus optimization. We show that the regularized approach significantly outperforms the Maxmin and Ensemble Q-learning algorithms as well as non-ensemble baselines.
    APack: Off-Chip, Lossless Data Compression for Efficient Deep Learning Inference. (arXiv:2201.08830v1 [cs.AR])
    (2 min) Data accesses between on- and off-chip memories account for a large fraction of overall energy consumption during inference with deep learning networks. We present APack, a simple and effective, lossless, off-chip memory compression technique for fixed-point quantized models. APack reduces data widths by exploiting the non-uniform value distribution in deep learning applications. APack can be used to increase the effective memory capacity, to reduce off-chip traffic, and/or to achieve the desired performance/energy targets while using smaller off-chip memories. APack builds upon arithmetic coding, encoding each value as an arithmetically coded variable length prefix, plus an offset. To maximize compression ratio a heuristic software algorithm partitions the value space into groups each sharing a common prefix. APack exploits memory access parallelism by using several, pipelined encoder/decoder units in parallel and keeps up with the high data bandwidth demands of deep learning. APack can be used with any machine learning accelerator. In the demonstrated configuration, APack is placed just before the off-chip memory controller so that he rest of the on-chip memory and compute units thus see the original data stream. We implemented the APack compressor and decompressor in Verilog and in a 65nm tech node demonstrating its performance and energy efficiency. Indicatively, APack reduces data footprint of weights and activations to 60% and 48% respectively on average over a wide set of 8-bit quantized models. It naturally adapts and compresses models that use even more aggressive quantization methods. When integrated with a Tensorcore-based accelerator, APack boosts the speedup and energy efficiency to 1.44X and 1.37X respectively.
    Deconfounding to Explanation Evaluation in Graph Neural Networks. (arXiv:2201.08802v1 [cs.LG])
    (2 min) Explainability of graph neural networks (GNNs) aims to answer ``Why the GNN made a certain prediction?'', which is crucial to interpret the model prediction. The feature attribution framework distributes a GNN's prediction to its input features (e.g., edges), identifying an influential subgraph as the explanation. When evaluating the explanation (i.e., subgraph importance), a standard way is to audit the model prediction based on the subgraph solely. However, we argue that a distribution shift exists between the full graph and the subgraph, causing the out-of-distribution problem. Furthermore, with an in-depth causal analysis, we find the OOD effect acts as the confounder, which brings spurious associations between the subgraph importance and model prediction, making the evaluation less reliable. In this work, we propose Deconfounded Subgraph Evaluation (DSE) which assesses the causal effect of an explanatory subgraph on the model prediction. While the distribution shift is generally intractable, we employ the front-door adjustment and introduce a surrogate variable of the subgraphs. Specifically, we devise a generative model to generate the plausible surrogates that conform to the data distribution, thus approaching the unbiased estimation of subgraph importance. Empirical results demonstrate the effectiveness of DSE in terms of explanation fidelity.
    Gradients are Not All You Need. (arXiv:2111.05803v2 [cs.LG] UPDATED)
    (2 min) Differentiable programming techniques are widely used in the community and are responsible for the machine learning renaissance of the past several decades. While these methods are powerful, they have limits. In this short report, we discuss a common chaos based failure mode which appears in a variety of differentiable circumstances, ranging from recurrent neural networks and numerical physics simulation to training learned optimizers. We trace this failure to the spectrum of the Jacobian of the system under study, and provide criteria for when a practitioner might expect this failure to spoil their differentiation based optimization algorithms.
    Unity Smoothing for Handling Inconsistent Evidence in Bayesian Networks and Unity Propagation for Faster Inference. (arXiv:2201.08659v1 [cs.LG])
    (2 min) We propose Unity Smoothing (US) for handling inconsistencies between a Bayesian network model and new unseen observations. We show that prediction accuracy, using the junction tree algorithm with US is comparable to that of Laplace smoothing. Moreover, in applications were sparsity of the data structures is utilized, US outperforms Laplace smoothing in terms of memory usage. Furthermore, we detail how to avoid redundant calculations that must otherwise be performed during the message passing scheme in the junction tree algorithm which we refer to as Unity Propagation (UP). Experimental results shows that it is always faster to exploit UP on top of the Lauritzen-Spigelhalter message passing scheme for the junction tree algorithm.
    A survey on datasets for fairness-aware machine learning. (arXiv:2110.00530v3 [cs.LG] UPDATED)
    (2 min) As decision-making increasingly relies on Machine Learning (ML) and (big) data, the issue of fairness in data-driven Artificial Intelligence (AI) systems is receiving increasing attention from both research and industry. A large variety of fairness-aware machine learning solutions have been proposed which involve fairness-related interventions in the data, learning algorithms and/or model outputs. However, a vital part of proposing new approaches is evaluating them empirically on benchmark datasets that represent realistic and diverse settings. Therefore, in this paper, we overview real-world datasets used for fairness-aware machine learning. We focus on tabular data as the most common data representation for fairness-aware machine learning. We start our analysis by identifying relationships between the different attributes, particularly w.r.t. protected attributes and class attribute, using a Bayesian network. For a deeper understanding of bias in the datasets, we investigate the interesting relationships using exploratory analysis.
    Learning elliptic partial differential equations with randomized linear algebra. (arXiv:2102.00491v2 [math.NA] UPDATED)
    (2 min) Given input-output pairs of an elliptic partial differential equation (PDE) in three dimensions, we derive the first theoretically-rigorous scheme for learning the associated Green's function $G$. By exploiting the hierarchical low-rank structure of $G$, we show that one can construct an approximant to $G$ that converges almost surely and achieves a relative error of $\mathcal{O}(\Gamma_\epsilon^{-1/2}\log^3(1/\epsilon)\epsilon)$ using at most $\mathcal{O}(\epsilon^{-6}\log^4(1/\epsilon))$ input-output training pairs with high probability, for any $0<\epsilon<1$. The quantity $0<\Gamma_\epsilon\leq 1$ characterizes the quality of the training dataset. Along the way, we extend the randomized singular value decomposition algorithm for learning matrices to Hilbert--Schmidt operators and characterize the quality of covariance kernels for PDE learning.
    Active Predictive Coding Networks: A Neural Solution to the Problem of Learning Reference Frames and Part-Whole Hierarchies. (arXiv:2201.08813v1 [cs.CV])
    (2 min) We introduce Active Predictive Coding Networks (APCNs), a new class of neural networks that solve a major problem posed by Hinton and others in the fields of artificial intelligence and brain modeling: how can neural networks learn intrinsic reference frames for objects and parse visual scenes into part-whole hierarchies by dynamically allocating nodes in a parse tree? APCNs address this problem by using a novel combination of ideas: (1) hypernetworks are used for dynamically generating recurrent neural networks that predict parts and their locations within intrinsic reference frames conditioned on higher object-level embedding vectors, and (2) reinforcement learning is used in conjunction with backpropagation for end-to-end learning of model parameters. The APCN architecture lends itself naturally to multi-level hierarchical learning and is closely related to predictive coding models of cortical function. Using the MNIST, Fashion-MNIST and Omniglot datasets, we demonstrate that APCNs can (a) learn to parse images into part-whole hierarchies, (b) learn compositional representations, and (c) transfer their knowledge to unseen classes of objects. With their ability to dynamically generate parse trees with part locations for objects, APCNs offer a new framework for explainable AI that leverages advances in deep learning while retaining interpretability and compositionality.
    Impacts of Students Academic Performance Trajectories on Final Academic Success. (arXiv:2201.08744v1 [cs.CY])
    (2 min) Many studies in the field of education analytics have identified student grade point averages (GPA) as an important indicator and predictor of students' final academic outcomes (graduate or halt). And while semester-to-semester fluctuations in GPA are considered normal, significant changes in academic performance may warrant more thorough investigation and consideration, particularly with regards to final academic outcomes. However, such an approach is challenging due to the difficulties of representing complex academic trajectories over an academic career. In this study, we apply a Hidden Markov Model (HMM) to provide a standard and intuitive classification over students' academic-performance levels, which leads to a compact representation of academic-performance trajectories. Next, we explore the relationship between different academic-performance trajectories and their correspondence to final academic success. Based on student transcript data from University of Central Florida, our proposed HMM is trained using sequences of students' course grades for each semester. Through the HMM, our analysis follows the expected finding that higher academic performance levels correlate with lower halt rates. However, in this paper, we identify that there exist many scenarios in which both improving or worsening academic-performance trajectories actually correlate to higher graduation rates. This counter-intuitive finding is made possible through the proposed and developed HMM model.
    Individual dynamic prediction of clinical endpoint from large dimensional longitudinal biomarker history: a landmark approach. (arXiv:2102.01466v2 [stat.ML] UPDATED)
    (3 min) The individual data collected throughout patient follow-up constitute crucial information for assessing the risk of a clinical event, and eventually for adapting a therapeutic strategy. Joint models and landmark models have been proposed to compute individual dynamic predictions from repeated measures to one or two markers. However, they hardly extend to the case where the complete patient history includes much more repeated markers possibly. Our objective was thus to propose a solution for the dynamic prediction of a health event that may exploit repeated measures of a possibly large number of markers. We combined a landmark approach extended to endogenous markers history with machine learning methods adapted to survival data. Each marker trajectory is modeled using the information collec…
    Continual learning under domain transfer with sparse synaptic bursting. (arXiv:2108.12056v5 [cs.LG] UPDATED)
    (0 min) Existing machines are functionally specific tools that were made for easy prediction and control. Tomorrow's machines may be closer to biological systems in their mutability, resilience, and autonomy. But first they must be capable of learning, and retaining, new information without repeated exposure to it. Past efforts to engineer such systems have sought to build or regulate artificial neural networks using task-specific modules with constrained circumstances of application. This has not yet enabled continual learning over long sequences of previously unseen data without corrupting existing knowledge: a problem known as catastrophic forgetting. In this paper, we introduce a system that can learn sequentially over previously unseen datasets (ImageNet, CIFAR-100) with little forgetting over time. This is accomplished by regulating the activity of weights in a convolutional neural network on the basis of inputs using top-down modulation generated by a second feed-forward neural network. We find that our method learns continually under domain transfer with sparse bursts of activity in weights that are recycled across tasks, rather than by maintaining task-specific modules. Sparse synaptic bursting is found to balance enhanced and diminished activity in a way that facilitates adaptation to new inputs without corrupting previously acquired functions. This behavior emerges during a prior meta-learning phase in which regulated synapses are selectively disinhibited, or grown, from an initial state of uniform suppression.
    Pairwise Learning for Neural Link Prediction. (arXiv:2112.02936v6 [cs.LG] UPDATED)
    (0 min) In this paper, we aim at providing an effective Pairwise Learning Neural Link Prediction (PLNLP) framework. The framework treats link prediction as a pairwise learning to rank problem and consists of four main components, i.e., neighborhood encoder, link predictor, negative sampler and objective function. The framework is flexible that any generic graph neural convolution or link prediction specific neural architecture could be employed as neighborhood encoder. For link predictor, we design different scoring functions, which could be selected based on different types of graphs. In negative sampler, we provide several sampling strategies, which are problem specific. As for objective function, we propose to use an effective ranking loss, which approximately maximizes the standard ranking metric AUC. We evaluate the proposed PLNLP framework on 4 link property prediction datasets of Open Graph Benchmark, including ogbl-ddi, ogbl-collab, ogbl-ppa and ogbl-ciation2. PLNLP achieves top 1 performance on ogbl-ddi and ogbl-collab, and top 2 performance on ogbl-ciation2 only with basic neural architecture. The performance demonstrates the effectiveness of PLNLP.
    A Non-Expert's Introduction to Data Ethics for Mathematicians. (arXiv:2201.07794v1 [math.HO] CROSS LISTED)
    (0 min) I give a short introduction to data ethics. My focal audience is mathematicians, but I hope that my discussion will also be useful to others. I am not an expert about data ethics, and my article is only a starting point. I encourage readers to examine the resources that I discuss and to continue to reflect carefully on data ethics and on the societal implications of data and data analysis throughout their lives.
    End-to-End Jet Classification of Boosted Top Quarks with the CMS Open Data. (arXiv:2104.14659v3 [physics.data-an] UPDATED)
    (0 min) We describe a novel application of the end-to-end deep learning technique to the task of discriminating top quark-initiated jets from those originating from the hadronization of a light quark or a gluon. The end-to-end deep learning technique combines deep learning algorithms and low-level detector representation of the high-energy collision event. In this study, we use low-level detector information from the simulated CMS Open Data samples to construct the top jet classifiers. To optimize classifier performance we progressively add low-level information from the CMS tracking detector, including pixel detector reconstructed hits and impact parameters, and demonstrate the value of additional tracking information even when no new spatial structures are added. Relying only on calorimeter energy deposits and reconstructed pixel detector hits, the end-to-end classifier achieves an AUC score of 0.975$\pm$0.002 for the task of classifying boosted top quark jets. After adding derived track quantities, the classifier AUC score increases to 0.9824$\pm$0.0013, serving as the first performance benchmark for these CMS Open Data samples. We additionally provide a timing performance comparison of different processor unit architectures for training the network.
    User Multi-Interest Modeling for Behavioral Cognition. (arXiv:2110.11337v3 [cs.LG] UPDATED)
    (0 min) Representation modeling based on user behavior sequences is an important direction in user cognition. In this study, we propose a novel framework called Multi-Interest User Representation Model. Specifically, the model consists of two sub-models. The first sub-module is used to encode user behaviors in any period into a super-high dimensional sparse vector. Then, we design a self-supervised network to map vectors in the first module to low-dimensional dense user representations by contrastive learning. With the help of a novel attention module which can learn multi-interests of user, the second sub-module achieves almost lossless dimensionality reduction. Experiments on several benchmark datasets show that our approach works well and outperforms state-of-the-art unsupervised representation methods in different downstream tasks.
    Graph Neural Networks: Taxonomy, Advances and Trends. (arXiv:2012.08752v3 [cs.LG] UPDATED)
    (0 min) Graph neural networks provide a powerful toolkit for embedding real-world graphs into low-dimensional spaces according to specific tasks. Up to now, there have been several surveys on this topic. However, they usually lay emphasis on different angles so that the readers can not see a panorama of the graph neural networks. This survey aims to overcome this limitation, and provide a comprehensive review on the graph neural networks. First of all, we provide a novel taxonomy for the graph neural networks, and then refer to up to 400 relevant literatures to show the panorama of the graph neural networks. All of them are classified into the corresponding categories. In order to drive the graph neural networks into a new stage, we summarize four future research directions so as to overcome the facing challenges. It is expected that more and more scholars can understand and exploit the graph neural networks, and use them in their research community.
    On the convergence of group-sparse autoencoders. (arXiv:2102.07003v2 [cs.LG] UPDATED)
    (0 min) Recent approaches in the theoretical analysis of model-based deep learning architectures have studied the convergence of gradient descent in shallow ReLU networks that arise from generative models whose hidden layers are sparse. Motivated by the success of architectures that impose structured forms of sparsity, we introduce and study a group-sparse autoencoder that accounts for a variety of generative models, and utilizes a group-sparse ReLU activation function to force the non-zero units at a given layer to occur in blocks. For clustering models, inputs that result in the same group of active units belong to the same cluster. We proceed to analyze the gradient dynamics of a shallow instance of the proposed autoencoder, trained with data adhering to a group-sparse generative model. In this setting, we theoretically prove the convergence of the network parameters to a neighborhood of the generating matrix. We validate our model through numerical analysis and highlight the superior performance of networks with a group-sparse ReLU compared to networks that utilize traditional ReLUs, both in sparse coding and in parameter recovery tasks. We also provide real data experiments to corroborate the simulated results, and emphasize the clustering capabilities of structured sparsity models.
    DISSECT: Disentangled Simultaneous Explanations via Concept Traversals. (arXiv:2105.15164v2 [cs.LG] UPDATED)
    (0 min) Explaining deep learning model inferences is a promising venue for scientific understanding, improving safety, uncovering hidden biases, evaluating fairness, and beyond, as argued by many scholars. One of the principal benefits of counterfactual explanations is allowing users to explore "what-if" scenarios through what does not and cannot exist in the data, a quality that many other forms of explanation such as heatmaps and influence functions are inherently incapable of doing. However, most previous work on generative explainability cannot disentangle important concepts effectively, produces unrealistic examples, or fails to retain relevant information. We propose a novel approach, DISSECT, that jointly trains a generator, a discriminator, and a concept disentangler to overcome such challenges using little supervision. DISSECT generates Concept Traversals (CTs), defined as a sequence of generated examples with increasing degrees of concepts that influence a classifier's decision. By training a generative model from a classifier's signal, DISSECT offers a way to discover a classifier's inherent "notion" of distinct concepts automatically rather than rely on user-predefined concepts. We show that DISSECT produces CTs that (1) disentangle several concepts, (2) are influential to a classifier's decision and are coupled to its reasoning due to joint training (3), are realistic, (4) preserve relevant information, and (5) are stable across similar inputs. We validate DISSECT on several challenging synthetic and realistic datasets where previous methods fall short of satisfying desirable criteria for interpretability and show that it performs consistently well and better than existing methods. Finally, we present experiments showing applications of DISSECT for detecting potential biases of a classifier and identifying spurious artifacts that impact predictions.
    ProtoRes: Proto-Residual Architecture for Deep Modeling of Human Pose. (arXiv:2106.01981v3 [cs.CV] UPDATED)
    (0 min) Our work focuses on the development of a learnable neural representation of human pose for advanced AI assisted animation tooling. Specifically, we tackle the problem of constructing a full static human pose based on sparse and variable user inputs (e.g. locations and/or orientations of a subset of body joints). To solve this problem, we propose a novel neural architecture that combines residual connections with prototype encoding of a partially specified pose to create a new complete pose from the learned latent space. We show that our architecture outperforms a baseline based on Transformer, both in terms of accuracy and computational efficiency. Additionally, we develop a user interface to integrate our neural model in Unity, a real-time 3D development platform. Furthermore, we introduce two new datasets representing the static human pose modeling problem, based on high-quality human motion capture data, which will be released publicly along with model code.
    Czech News Dataset for Semantic Textual Similarity. (arXiv:2108.08708v3 [cs.CL] UPDATED)
    (0 min) This paper describes a novel dataset consisting of sentences with semantic similarity annotations. The data originate from the journalistic domain in the Czech language. We describe the process of collecting and annotating the data in detail. The dataset contains 138,556 human annotations divided into train and test sets. In total, 485 journalism students participated in the creation process. To increase the reliability of the test set, we compute the annotation as an average of 9 individual annotations. We evaluate the quality of the dataset by measuring inter and intra annotation annotators' agreements. Beside agreement numbers, we provide detailed statistics of the collected dataset. We conclude our paper with a baseline experiment of building a system for predicting the semantic similarity of sentences. Due to the massive number of training annotations (116 956), the model can perform significantly better than an average annotator (0,92 versus 0,86 of Person's correlation coefficients).
    Satisficing Paths and Independent Multi-Agent Reinforcement Learning in Stochastic Games. (arXiv:2110.04638v2 [cs.GT] UPDATED)
    (0 min) In multi-agent reinforcement learning (MARL), independent learners are those that do not access the action selections of other agents in the system. Due to the decentralization of information, it is generally difficult to design independent learners that drive the system to equilibrium. This paper investigates the feasibility of using \emph{satisficing} dynamics to guide independent learners to approximate equilibrium policies in non-episodic, discounted stochastic games. Satisficing refers to halting search in an optimization problem upon finding a satisfactory but possibly suboptimal input; here, we take it to mean halting search upon finding an input that achieves a cost within $\epsilon$ of the minimum cost. In this paper, we define a useful structural concept for games, to be termed the $\epsilon$-satisficing paths property, and we prove that this property holds for any $\epsilon \geq 0$ in general two-player stochastic games and in $N$-player stochastic games satisfying a symmetry condition. To illustrate the utility of this property, we present an independent learning algorithm that comes with high probability guarantees of approximate equilibrium in $N$-player symmetric games. This guarantee is made assuming symmetry alone, without additional assumptions such as a zero sum, team, or potential game structure. %
    Multi-Agent Reinforcement Learning for Active Voltage Control on Power Distribution Networks. (arXiv:2110.14300v5 [cs.LG] UPDATED)
    (0 min) This paper presents a problem in power networks that creates an exciting and yet challenging real-world scenario for application of multi-agent reinforcement learning (MARL). The emerging trend of decarbonisation is placing excessive stress on power distribution networks. Active voltage control is seen as a promising solution to relieve power congestion and improve voltage quality without extra hardware investment, taking advantage of the controllable apparatuses in the network, such as roof-top photovoltaics (PVs) and static var compensators (SVCs). These controllable apparatuses appear in a vast number and are distributed in a wide geographic area, making MARL a natural candidate. This paper formulates the active voltage control problem in the framework of Dec-POMDP and establishes an open-source environment. It aims to bridge the gap between the power community and the MARL community and be a drive force towards real-world applications of MARL algorithms. Finally, we analyse the special characteristics of the active voltage control problems that cause challenges (e.g. interpretability) for state-of-the-art MARL approaches, and summarise the potential directions.
    Deep Gravity: enhancing mobility flows generation with deep neural networks and geographic information. (arXiv:2012.00489v5 [cs.LG] UPDATED)
    (0 min) The movements of individuals within and among cities influence critical aspects of our society, such as well-being, the spreading of epidemics, and the quality of the environment. When information about mobility flows is not available for a particular region of interest, we must rely on mathematical models to generate them. In this work, we propose the Deep Gravity model, an effective method to generate flow probabilities that exploits many variables (e.g., land use, road network, transport, food, health facilities) extracted from voluntary geographic data, and uses deep neural networks to discover non-linear relationships between those variables and mobility flows. Our experiments, conducted on mobility flows in England, Italy, and New York State, show that Deep Gravity has good geographic generalization capability, achieving a significant increase in performance (especially in densely populated regions of interest) with respect to the classic gravity model and models that do not use deep neural networks or geographic data. We also show how flows generated by Deep Gravity may be explained in terms of the geographic features using explainable AI techniques.
    Symbolic Behaviour in Artificial Intelligence. (arXiv:2102.03406v2 [cs.AI] UPDATED)
    (2 min) The ability to use symbols is the pinnacle of human intelligence, but has yet to be fully replicated in machines. Here we argue that the path towards symbolically fluent artificial intelligence (AI) begins with a reinterpretation of what symbols are, how they come to exist, and how a system behaves when it uses them. We begin by offering an interpretation of symbols as entities whose meaning is established by convention. But crucially, something is a symbol only for those who demonstrably and actively participate in this convention. We then outline how this interpretation thematically unifies the behavioural traits humans exhibit when they use symbols. This motivates our proposal that the field place a greater emphasis on symbolic behaviour rather than particular computational mechanisms inspired by more restrictive interpretations of symbols. Finally, we suggest that AI research explore social and cultural engagement as a tool to develop the cognitive machinery necessary for symbolic behaviour to emerge. This approach will allow for AI to interpret something as symbolic on its own rather than simply manipulate things that are only symbols to human onlookers, and thus will ultimately lead to AI with more human-like symbolic fluency.
    Marginal Effects for Non-Linear Prediction Functions. (arXiv:2201.08837v1 [cs.LG])
    (2 min) Beta coefficients for linear regression models represent the ideal form of an interpretable feature effect. However, for non-linear models and especially generalized linear models, the estimated coefficients cannot be interpreted as a direct feature effect on the predicted outcome. Hence, marginal effects are typically used as approximations for feature effects, either in the shape of derivatives of the prediction function or forward differences in prediction due to a change in a feature value. While marginal effects are commonly used in many scientific fields, they have not yet been adopted as a model-agnostic interpretation method for machine learning models. This may stem from their inflexibility as a univariate feature effect and their inability to deal with the non-linearities found in black box models. We introduce a new class of marginal effects termed forward marginal effects. We argue to abandon derivatives in favor of better-interpretable forward differences. Furthermore, we generalize marginal effects based on forward differences to multivariate changes in feature values. To account for the non-linearity of prediction functions, we introduce a non-linearity measure for marginal effects. We argue against summarizing feature effects of a non-linear prediction function in a single metric such as the average marginal effect. Instead, we propose to partition the feature space to compute conditional average marginal effects on feature subspaces, which serve as conditional feature effect estimates.
    Occupancy Information Ratio: Infinite-Horizon, Information-Directed, Parameterized Policy Search. (arXiv:2201.08832v1 [cs.LG])
    (0 min) We develop a new measure of the exploration/exploitation trade-off in infinite-horizon reinforcement learning problems called the occupancy information ratio (OIR), which is comprised of a ratio between the infinite-horizon average cost of a policy and the entropy of its long-term state occupancy measure. The OIR ensures that no matter how many trajectories an RL agent traverses or how well it learns to minimize cost, it maintains a healthy skepticism about its environment, in that it defines an optimal policy which induces a high-entropy occupancy measure. Different from earlier information ratio notions, OIR is amenable to direct policy search over parameterized families, and exhibits hidden quasiconcavity through invocation of the perspective transformation. This feature ensures that under appropriate policy parameterizations, the OIR optimization problem has no spurious stationary points, despite the overall problem's nonconvexity. We develop for the first time policy gradient and actor-critic algorithms for OIR optimization based upon a new entropy gradient theorem, and establish both asymptotic and non-asymptotic convergence results with global optimality guarantees. In experiments, these methodologies outperform several deep RL baselines in problems with sparse rewards, where many trajectories may be uninformative and skepticism about the environment is crucial to success.
    CapillaryNet: An Automated System to Quantify Skin Capillary Density and Red Blood Cell Velocity from Handheld Vital Microscopy. (arXiv:2104.11574v4 [eess.IV] UPDATED)
    (0 min) Capillaries are the smallest vessels in the body responsible for delivering oxygen and nutrients to surrounding cells. Various life-threatening diseases are known to alter the density of healthy capillaries and the flow velocity of erythrocytes within the capillaries. In previous studies, capillary density and flow velocity were manually assessed by trained specialists. However, manual analysis of a standard 20-second microvascular video requires 20 minutes on average and necessitates extensive training. Thus, manual analysis has been reported to hinder the application of microvascular microscopy in a clinical environment. To address this problem, this paper presents a fully automated state-of-the-art system to quantify skin nutritive capillary density and red blood cell velocity captured by handheld-based microscopy videos. The proposed method combines the speed of traditional computer vision algorithms with the accuracy of convolutional neural networks to enable clinical capillary analysis. The results show that the proposed system fully automates capillary detection with an accuracy exceeding that of trained analysts and measures several novel microvascular parameters that had eluded quantification thus far, namely, capillary hematocrit and intracapillary flow velocity heterogeneity. The proposed end-to-end system, named CapillaryNet, can detect capillaries at $\sim$0.9 seconds per frame with $\sim$93\% accuracy. The system is currently being used as a clinical research product in a larger e-health application to analyse capillary data captured from patients suffering from COVID-19, pancreatitis, and acute heart diseases. CapillaryNet narrows the gap between the analysis of microcirculation images in a clinical environment and state-of-the-art systems.
    Automated Deep Learning: Neural Architecture Search Is Not the End. (arXiv:2112.09245v2 [cs.LG] UPDATED)
    (0 min) Deep learning (DL) has proven to be a highly effective approach for developing models in diverse contexts, including visual perception, speech recognition, and machine translation. However, the end-to-end process for applying DL is not trivial. It requires grappling with problem formulation and context understanding, data engineering, model development, deployment, continuous monitoring and maintenance, and so on. Moreover, each of these steps typically relies heavily on humans, in terms of both knowledge and interactions, which impedes the further advancement and democratization of DL. Consequently, in response to these issues, a new field has emerged over the last few years: automated deep learning (AutoDL). This endeavor seeks to minimize the need for human involvement and is best known for its achievements in neural architecture search (NAS), a topic that has been the focus of several surveys. That stated, NAS is not the be-all and end-all of AutoDL. Accordingly, this review adopts an overarching perspective, examining research efforts into automation across the entirety of an archetypal DL workflow. In so doing, this work also proposes a comprehensive set of ten criteria by which to assess existing work in both individual publications and broader research areas. These criteria are: novelty, solution quality, efficiency, stability, interpretability, reproducibility, engineering quality, scalability, generalizability, and eco-friendliness. Thus, ultimately, this review provides an evaluative overview of AutoDL in the early 2020s, identifying where future opportunities for progress may exist.
    Less is Less: When Are Snippets Insufficient for Human vs Machine Relevance Estimation?. (arXiv:2201.08721v1 [cs.IR])
    (0 min) Traditional information retrieval (IR) ranking models process the full text of documents. Newer models based on Transformers, however, would incur a high computational cost when processing long texts, so typically use only snippets from the document instead. The model's input based on a document's URL, title, and snippet (UTS) is akin to the summaries that appear on a search engine results page (SERP) to help searchers decide which result to click. This raises questions about when such summaries are sufficient for relevance estimation by the ranking model or the human assessor, and whether humans and machines benefit from the document's full text in similar ways. To answer these questions, we study human and neural model based relevance assessments on 12k query-documents sampled from Bing's search logs. We compare changes in the relevance assessments when only the document summaries and when the full text is also exposed to assessors, studying a range of query and document properties, e.g., query type, snippet length. Our findings show that the full text is beneficial for humans and a BERT model for similar query and document types, e.g., tail, long queries. A closer look, however, reveals that humans and machines respond to the additional input in very different ways. Adding the full text can also hurt the ranker's performance, e.g., for navigational queries.
    CARMI: A Cache-Aware Learned Index with a Cost-based Construction Algorithm. (arXiv:2103.00858v3 [cs.DB] UPDATED)
    (0 min) Learned indexes, which use machine learning models to replace traditional index structures, have shown promising results in recent studies. Existing learned indexes use heuristic rules to construct index structures, which are often suboptimal and sensitive to data distribution. In this paper, we argue that upper-level RMI nodes should focus on data partitioning instead of model fitting, and show that it leads to much better results in real-world datasets. We introduce entropy as a metric to quantify and characterize the models in learned indexes, which provides a new theoretical basis for subsequent works. Moreover, we propose a new memory layout design with a fixed node size throughout the tree structure, which allows the type of each node to be flexibly chosen at runtime. We propose CARMI, a new efficient and updatable cache-aware RMI framework. To reduce reliance on the expertise of database administrators, CARMI uses a hybrid construction algorithm to automatically construct the index structures under various datasets and workloads without any manual tuning. Our experimental study shows that CARMI performs better and is more robust compared to baselines, achieving an average of 2.37x/1.98x speedup compared to B+ Tree/ALEX, while using only about 0.70x memory space of B+ Tree. In the SOSD platform, CARMI outperforms all the baselines over the real-world datasets, with an average speedup of 1.21x over the nearest competitor. We believe that our theoretical analysis and proposed framework can help learned indexes to get closer to practical deployment.
    Sequential Item Recommendation in the MOBA Game Dota 2. (arXiv:2201.08724v1 [cs.LG])
    (0 min) Multiplayer Online Battle Arena (MOBA) games such as Dota 2 attract hundreds of thousands of players every year. Despite the large player base, it is still important to attract new players to prevent the community of a game from becoming inactive. Entering MOBA games is, however, often demanding, requiring the player to learn numerous skills at once. An important factor of success is buying the correct items which forms a complex task depending on various in-game factors such as already purchased items, the team composition, or available resources. A recommendation system can support players by reducing the mental effort required to choose a suitable item, helping, e.g., newer players or players returning to the game after a longer break, to focus on other aspects of the game. Since Sequential Item Recommendation (SIR) has proven to be effective in various domains (e.g. e-commerce, movie recommendation or playlist continuation), we explore the applicability of well-known SIR models in the context of purchase recommendations in Dota 2. To facilitate this research, we collect, analyze and publish Dota-350k, a new large dataset based on recent Dota 2 matches. We find that SIR models can be employed effectively for item recommendation in Dota 2. Our results show that models that consider the order of purchases are the most effective. In contrast to other domains, we find RNN-based models to outperform the more recent Transformer-based architectures on Dota-350k.
    Analysis and algorithms for $\ell_p$-based semi-supervised learning on graphs. (arXiv:1901.05031v3 [math.NA] UPDATED)
    (0 min) This paper addresses theory and applications of $\ell_p$-based Laplacian regularization in semi-supervised learning. The graph $p$-Laplacian for $p>2$ has been proposed recently as a replacement for the standard ($p=2$) graph Laplacian in semi-supervised learning problems with very few labels, where Laplacian learning is degenerate. In the first part of the paper we prove new discrete to continuum convergence results for $p$-Laplace problems on $k$-nearest neighbor ($k$-NN) graphs, which are more commonly used in practice than random geometric graphs. Our analysis shows that, on $k$-NN graphs, the $p$-Laplacian retains information about the data distribution as $p\to \infty$ and Lipschitz learning ($p=\infty$) is sensitive to the data distribution. This situation can be contrasted with random geometric graphs, where the $p$-Laplacian forgets the data distribution as $p\to \infty$. We also present a general framework for proving discrete to continuum convergence results in graph-based learning that only requires pointwise consistency and monotonicity. In the second part of the paper, we develop fast algorithms for solving the variational and game-theoretic $p$-Laplace equations on weighted graphs for $p>2$. We present several efficient and scalable algorithms for both formulations, and present numerical results on synthetic data indicating their convergence properties. Finally, we conduct extensive numerical experiments on the MNIST, FashionMNIST and EMNIST datasets that illustrate the effectiveness of the $p$-Laplacian formulation for semi-supervised learning with few labels. In particular, we find that Lipschitz learning ($p=\infty$) performs well with very few labels on $k$-NN graphs, which experimentally validates our theoretical findings that Lipschitz learning retains information about the data distribution (the unlabeled data) on $k$-NN graphs.
    AutoDistill: an End-to-End Framework to Explore and Distill Hardware-Efficient Language Models. (arXiv:2201.08539v1 [cs.LG])
    (0 min) Recently, large pre-trained models have significantly improved the performance of various Natural LanguageProcessing (NLP) tasks but they are expensive to serve due to long serving latency and large memory usage. To compress these models, knowledge distillation has attracted an increasing amount of interest as one of the most effective methods for model compression. However, existing distillation methods have not yet addressed the unique challenges of model serving in datacenters, such as handling fast evolving models, considering serving performance, and optimizing for multiple objectives. To solve these problems, we propose AutoDistill, an end-to-end model distillation framework integrating model architecture exploration and multi-objective optimization for building hardware-efficient NLP pre-trained models. We use Bayesian Optimization to conduct multi-objective Neural Architecture Search for selecting student model architectures. The proposed search comprehensively considers both prediction accuracy and serving latency on target hardware. The experiments on TPUv4i show the finding of seven model architectures with better pre-trained accuracy (up to 3.2% higher) and lower inference latency (up to 1.44x faster) than MobileBERT. By running downstream NLP tasks in the GLUE benchmark, the model distilled for pre-training by AutoDistill with 28.5M parameters achieves an 81.69 average score, which is higher than BERT_BASE, DistillBERT, TinyBERT, NAS-BERT, and MobileBERT. The most compact model found by AutoDistill contains only 20.6M parameters but still outperform BERT_BASE(109M), DistillBERT(67M), TinyBERT(67M), and MobileBERT(25.3M) regarding the average GLUE score. By evaluating on SQuAD, a model found by AutoDistill achieves an 88.4% F1 score with 22.8M parameters, which reduces parameters by more than 62% while maintaining higher accuracy than DistillBERT, TinyBERT, and NAS-BERT.
    Evaluating Generalization in Classical and Quantum Generative Models. (arXiv:2201.08770v1 [cs.LG])
    (0 min) Defining and accurately measuring generalization in generative models remains an ongoing challenge and a topic of active research within the machine learning community. This is in contrast to discriminative models, where there is a clear definition of generalization, i.e., the model's classification accuracy when faced with unseen data. In this work, we construct a simple and unambiguous approach to evaluate the generalization capabilities of generative models. Using the sample-based generalization metrics proposed here, any generative model, from state-of-the-art classical generative models such as GANs to quantum models such as Quantum Circuit Born Machines, can be evaluated on the same ground on a concrete well-defined framework. In contrast to other sample-based metrics for probing generalization, we leverage constrained optimization problems (e.g., cardinality constrained problems) and use these discrete datasets to define specific metrics capable of unambiguously measuring the quality of the samples and the model's generalization capabilities for generating data beyond the training set but still within the valid solution space. Additionally, our metrics can diagnose trainability issues such as mode collapse and overfitting, as we illustrate when comparing GANs to quantum-inspired models built out of tensor networks. Our simulation results show that our quantum-inspired models have up to a $68 \times$ enhancement in generating unseen unique and valid samples compared to GANs, and a ratio of 61:2 for generating samples with better quality than those observed in the training set. We foresee these metrics as valuable tools for rigorously defining practical quantum advantage in the domain of generative modeling.
    Real-Time Seizure Detection using EEG: A Comprehensive Comparison of Recent Approaches under a Realistic Setting. (arXiv:2201.08780v1 [cs.LG])
    (0 min) Electroencephalogram (EEG) is an important diagnostic test that physicians use to record brain activity and detect seizures by monitoring the signals. There have been several attempts to detect seizures and abnormalities in EEG signals with modern deep learning models to reduce the clinical burden. However, they cannot be fairly compared against each other as they were tested in distinct experimental settings. Also, some of them are not trained in real-time seizure detection tasks, making it hard for on-device applications. Therefore in this work, for the first time, we extensively compare multiple state-of-the-art models and signal feature extractors in a real-time seizure detection framework suitable for real-world application, using various evaluation metrics including a new one we propose to evaluate more practical aspects of seizure detection models. Our code is available at https://github.com/AITRICS/EEG_real_time_seizure_detection.
    Dual Contrastive Learning: Text Classification via Label-Aware Data Augmentation. (arXiv:2201.08702v1 [cs.CL])
    (0 min) Contrastive learning has achieved remarkable success in representation learning via self-supervision in unsupervised settings. However, effectively adapting contrastive learning to supervised learning tasks remains as a challenge in practice. In this work, we introduce a dual contrastive learning (DualCL) framework that simultaneously learns the features of input samples and the parameters of classifiers in the same space. Specifically, DualCL regards the parameters of the classifiers as augmented samples associating to different labels and then exploits the contrastive learning between the input samples and the augmented samples. Empirical studies on five benchmark text classification datasets and their low-resource version demonstrate the improvement in classification accuracy and confirm the capability of learning discriminative representations of DualCL.
    Low-Interception Waveform: To Prevent the Recognition of Spectrum Waveform Modulation via Adversarial Examples. (arXiv:2201.08731v1 [cs.LG])
    (0 min) Deep learning is applied to many complex tasks in the field of wireless communication, such as modulation recognition of spectrum waveforms, because of its convenience and efficiency. This leads to the problem of a malicious third party using a deep learning model to easily recognize the modulation format of the transmitted waveform. Some existing works address this problem directly using the concept of adversarial examples in the image domain without fully considering the characteristics of the waveform transmission in the physical world. Therefore, we propose a low-intercept waveform~(LIW) generation method that can reduce the probability of the modulation being recognized by a third party without affecting the reliable communication of the friendly party. Our LIW exhibits significant low-interception performance even in the physical hardware experiment, decreasing the accuracy of the state of the art model to approximately $15\%$ with small perturbations.
    Cost-sensitive Boosting Pruning Trees for depression detection on Twitter. (arXiv:1906.00398v3 [cs.LG] UPDATED)
    (0 min) Depression is one of the most common mental health disorders, and a large number of depressed people commit suicide each year. Potential depression sufferers usually do not consult psychological doctors because they feel ashamed or are unaware of any depression, which may result in severe delay of diagnosis and treatment. In the meantime, evidence shows that social media data provides valuable clues about physical and mental health conditions. In this paper, we argue that it is feasible to identify depression at an early stage by mining online social behaviours. Our approach, which is innovative to the practice of depression detection, does not rely on the extraction of numerous or complicated features to achieve accurate depression detection. Instead, we propose a novel classifier, namely, Cost-sensitive Boosting Pruning Trees (CBPT), which demonstrates a strong classification ability on two publicly accessible Twitter depression detection datasets. To comprehensively evaluate the classification capability of the CBPT, we use additional three datasets from the UCI machine learning repository and the CBPT obtains appealing classification results against several state of the arts boosting algorithms. Finally, we comprehensively explore the influence factors of model prediction, and the results manifest that our proposed framework is promising for identifying Twitter users with depression.
    Spatiotemporal Analysis Using Riemannian Composition of Diffusion Operators. (arXiv:2201.08530v1 [stat.ML])
    (0 min) Multivariate time-series have become abundant in recent years, as many data-acquisition systems record information through multiple sensors simultaneously. In this paper, we assume the variables pertain to some geometry and present an operator-based approach for spatiotemporal analysis. Our approach combines three components that are often considered separately: (i) manifold learning for building operators representing the geometry of the variables, (ii) Riemannian geometry of symmetric positive-definite matrices for multiscale composition of operators corresponding to different time samples, and (iii) spectral analysis of the composite operators for extracting different dynamic modes. We propose a method that is analogous to the classical wavelet analysis, which we term Riemannian multi-resolution analysis (RMRA). We provide some theoretical results on the spectral analysis of the composite operators, and we demonstrate the proposed method on simulations and on real data.
    FedComm: Federated Learning as a Medium for Covert Communication. (arXiv:2201.08786v1 [cs.CR])
    (0 min) Proposed as a solution to mitigate the privacy implications related to the adoption of deep learning solutions, Federated Learning (FL) enables large numbers of participants to successfully train deep neural networks without having to reveal the actual private training data. To date, a substantial amount of research has investigated the security and privacy properties of FL, resulting in a plethora of innovative attack and defense strategies. This paper thoroughly investigates the communication capabilities of an FL scheme. In particular, we show that a party involved in the FL learning process can use FL as a covert communication medium to send an arbitrary message. We introduce FedComm, a novel covert-communication technique that enables robust sharing and transfer of targeted payloads within the FL framework. Our extensive theoretical and empirical evaluations show that FedComm provides a stealthy communication channel, with minimal disruptions to the training process. Our experiments show that FedComm, allowed us to successfully deliver 100% of a payload in the order of kilobits before the FL procedure converges. Our evaluation also shows that FedComm is independent of the application domain and the neural network architecture used by the underlying FL scheme.
    Physical Activity Recognition by Utilising Smartphone Sensor Signals. (arXiv:2201.08688v1 [cs.HC])
    (2 min) Human physical motion activity identification has many potential applications in various fields, such as medical diagnosis, military sensing, sports analysis, and human-computer security interaction. With the recent advances in smartphones and wearable technologies, it has become common for such devices to have embedded motion sensors that are able to sense even small body movements. This study collected human activity data from 60 participants across two different days for a total of six activities recorded by gyroscope and accelerometer sensors in a modern smartphone. The paper investigates to what extent different activities can be identified by utilising machine learning algorithms using approaches such as majority algorithmic voting. More analyses are also provided that reveal which time and frequency domain based features were best able to identify individuals motion activity types. Overall, the proposed approach achieved a classification accuracy of 98 percent in identifying four different activities: walking, walking upstairs, walking downstairs, and sitting while the subject is calm and doing a typical desk-based activity.
    Learning deterministic hydrodynamic equations from stochastic active particle dynamics. (arXiv:2201.08623v1 [cond-mat.soft])
    (2 min) We present a principled data-driven strategy for learning deterministic hydrodynamic models directly from stochastic non-equilibrium active particle trajectories. We apply our method to learning a hydrodynamic model for the propagating density lanes observed in self-propelled particle systems and to learning a continuum description of cell dynamics in epithelial tissues. We also infer from stochastic particle trajectories the latent phoretic fields driving chemotaxis. This demonstrates that statistical learning theory combined with physical priors can enable discovery of multi-scale models of non-equilibrium stochastic processes characteristic of collective movement in living systems.
    Adaptive Data Analysis with Correlated Observations. (arXiv:2201.08704v1 [cs.LG])
    (2 min) The vast majority of the work on adaptive data analysis focuses on the case where the samples in the dataset are independent. Several approaches and tools have been successfully applied in this context, such as differential privacy, max-information, compression arguments, and more. The situation is far less well-understood without the independence assumption. We embark on a systematic study of the possibilities of adaptive data analysis with correlated observations. First, we show that, in some cases, differential privacy guarantees generalization even when there are dependencies within the sample, which we quantify using a notion we call Gibbs-dependence. We complement this result with a tight negative example. Second, we show that the connection between transcript-compression and adaptive data analysis can be extended to the non-iid setting.
    Instance-Dependent Confidence and Early Stopping for Reinforcement Learning. (arXiv:2201.08536v1 [stat.ML])
    (2 min) Various algorithms for reinforcement learning (RL) exhibit dramatic variation in their convergence rates as a function of problem structure. Such problem-dependent behavior is not captured by worst-case analyses and has accordingly inspired a growing effort in obtaining instance-dependent guarantees and deriving instance-optimal algorithms for RL problems. This research has been carried out, however, primarily within the confines of theory, providing guarantees that explain \textit{ex post} the performance differences observed. A natural next step is to convert these theoretical guarantees into guidelines that are useful in practice. We address the problem of obtaining sharp instance-dependent confidence regions for the policy evaluation problem and the optimal value estimation problem of an MDP, given access to an instance-optimal algorithm. As a consequence, we propose a data-dependent stopping rule for instance-optimal algorithms. The proposed stopping rule adapts to the instance-specific difficulty of the problem and allows for early termination for problems with favorable structure.
    High-Dimensional Inference over Networks: Linear Convergence and Statistical Guarantees. (arXiv:2201.08507v1 [cs.LG])
    (2 min) We study sparse linear regression over a network of agents, modeled as an undirected graph and no server node. The estimation of the $s$-sparse parameter is formulated as a constrained LASSO problem wherein each agent owns a subset of the $N$ total observations. We analyze the convergence rate and statistical guarantees of a distributed projected gradient tracking-based algorithm under high-dimensional scaling, allowing the ambient dimension $d$ to grow with (and possibly exceed) the sample size $N$. Our theory shows that, under standard notions of restricted strong convexity and smoothness of the loss functions, suitable conditions on the network connectivity and algorithm tuning, the distributed algorithm converges globally at a {\it linear} rate to an estimate that is within the centralized {\it statistical precision} of the model, $O(s\log d/N)$. When $s\log d/N=o(1)$, a condition necessary for statistical consistency, an $\varepsilon$-optimal solution is attained after $\mathcal{O}(\kappa \log (1/\varepsilon))$ gradient computations and $O (\kappa/(1-\rho) \log (1/\varepsilon))$ communication rounds, where $\kappa$ is the restricted condition number of the loss function and $\rho$ measures the network connectivity. The computation cost matches that of the centralized projected gradient algorithm despite having data distributed; whereas the communication rounds reduce as the network connectivity improves. Overall, our study reveals interesting connections between statistical efficiency, network connectivity \& topology, and convergence rate in high dimensions.
    Deep Q-learning: a robust control approach. (arXiv:2201.08610v1 [cs.LG])
    (2 min) In this paper, we place deep Q-learning into a control-oriented perspective and study its learning dynamics with well-established techniques from robust control. We formulate an uncertain linear time-invariant model by means of the neural tangent kernel to describe learning. We show the instability of learning and analyze the agent's behavior in frequency-domain. Then, we ensure convergence via robust controllers acting as dynamical rewards in the loss function. We synthesize three controllers: state-feedback gain scheduling $\mathcal{H}_2$, dynamic $\mathcal{H}_\infty$, and constant gain $\mathcal{H}_\infty$ controllers. Setting up the learning agent with a control-oriented tuning methodology is more transparent and has well-established literature compared to the heuristics in reinforcement learning. In addition, our approach does not use a target network and randomized replay memory. The role of the target network is overtaken by the control input, which also exploits the temporal dependency of samples (opposed to a randomized memory buffer). Numerical simulations in different OpenAI Gym environments suggest that the $\mathcal{H}_\infty$ controlled learning performs slightly better than Double deep Q-learning.
    Random Noise vs State-of-the-Art Probabilistic Forecasting Methods : A Case Study on CRPS-Sum Discrimination Ability. (arXiv:2201.08671v1 [cs.LG])
    (2 min) The recent developments in the machine learning domain have enabled the development of complex multivariate probabilistic forecasting models. Therefore, it is pivotal to have a precise evaluation method to gauge the performance and predictability power of these complex methods. To do so, several evaluation metrics have been proposed in the past (such as Energy Score, Dawid-Sebastiani score, variogram score), however, they cannot reliably measure the performance of a probabilistic forecaster. Recently, CRPS-sum has gained a lot of prominence as a reliable metric for multivariate probabilistic forecasting. This paper presents a systematic evaluation of CRPS-sum to understand its discrimination ability. We show that the statistical properties of target data affect the discrimination ability of CRPS-Sum. Furthermore, we highlight that CRPS-Sum calculation overlooks the performance of the model on each dimension. These flaws can lead us to an incorrect assessment of model performance. Finally, with experiments on the real-world dataset, we demonstrate that the shortcomings of CRPS-Sum provide a misleading indication of the probabilistic forecasting performance method. We show that it is easily possible to have a better CRPS-Sum for a dummy model, which looks like random noise, in comparison to the state-of-the-art method.
    Human Activity Recognition models using Limited Consumer Device Sensors and Machine Learning. (arXiv:2201.08565v1 [cs.LG])
    (2 min) Human activity recognition has grown in popularity with its increase of applications within daily lifestyles and medical environments. The goal of having efficient and reliable human activity recognition brings benefits such as accessible use and better allocation of resources; especially in the medical industry. Activity recognition and classification can be obtained using many sophisticated data recording setups, but there is also a need in observing how performance varies among models that are strictly limited to using sensor data from easily accessible devices: smartphones and smartwatches. This paper presents the findings of different models that are limited to train using such sensors. The models are trained using either the k-Nearest Neighbor, Support Vector Machine, or Random Forest classifier algorithms. Performance and evaluations are done by comparing various model performances using different combinations of mobile sensors and how they affect recognitive performances of models. Results show promise for models trained strictly using limited sensor data collected from only smartphones and smartwatches coupled with traditional machine learning concepts and algorithms.
    A deep learning energy method for hyperelasticity and viscoelasticity. (arXiv:2201.08690v1 [cs.LG])
    (2 min) The potential energy formulation and deep learning are merged to solve partial differential equations governing the deformation in hyperelastic and viscoelastic materials. The presented deep energy method (DEM) is self-contained and meshfree. It can accurately capture the three-dimensional (3D) mechanical response without requiring any time-consuming training data generation by classical numerical methods such as the finite element method. Once the model is appropriately trained, the response can be attained almost instantly at any point in the physical domain, given its spatial coordinates. Therefore, the deep energy method is potentially a promising standalone method for solving partial differential equations describing the mechanical deformation of materials or structural systems and other physical phenomena.
    To SMOTE, or not to SMOTE?. (arXiv:2201.08528v1 [cs.LG])
    (2 min) In imbalanced binary classification problems the objective metric is often non-symmetric and associates a higher penalty with the minority samples. On the other hand, the loss function used for training is usually symmetric - equally penalizing majority and minority samples. Balancing schemes, that augment the data to be more balanced before training the model, were proposed to address this discrepancy and were shown to improve prediction performance empirically on tabular data. However, recent studies of consistent classifiers suggest that the metric discrepancy might not hinder prediction performance. In light of these recent theoretical results, we carefully revisit the empirical study of balancing tabular data. Our extensive experiments, on 73 datasets, show that generally, in accordance with theory, best prediction is achieved by using a strong consistent classifier and balancing is not beneficial. We further identity several scenarios for which balancing is effective and observe that prior studies mainly focus on these settings.
    Privacy Policies Across the Ages: Content and Readability of Privacy Policies 1996--2021. (arXiv:2201.08739v1 [cs.CR])
    (2 min) It is well-known that most users do not read privacy policies, but almost all users tick the box to agree with them. In this paper, we analyze the 25-year history of privacy policies using methods from transparency research, machine learning, and natural language processing. Specifically, we collect a large-scale longitudinal corpus of privacy policies from 1996 to 2021 and analyze the length and readability of privacy policies as well as their content in terms of the data practices they describe, the rights they grant to users, and the rights they reserve for their organizations. We pay particular attention to changes in response to recent privacy regulations such as the GDPR and CCPA. Our results show that policies are getting longer and harder to read, especially after new regulations take effect, and we find a range of concerning data practices. Our results allow us to speculate why privacy policies are rarely read and propose changes that would make privacy policies serve their readers instead of their writers.
    Optimal variance-reduced stochastic approximation in Banach spaces. (arXiv:2201.08518v1 [math.ST])
    (2 min) We study the problem of estimating the fixed point of a contractive operator defined on a separable Banach space. Focusing on a stochastic query model that provides noisy evaluations of the operator, we analyze a variance-reduced stochastic approximation scheme, and establish non-asymptotic bounds for both the operator defect and the estimation error, measured in an arbitrary semi-norm. In contrast to worst-case guarantees, our bounds are instance-dependent, and achieve the local asymptotic minimax risk non-asymptotically. For linear operators, contractivity can be relaxed to multi-step contractivity, so that the theory can be applied to problems like average reward policy evaluation problem in reinforcement learning. We illustrate the theory via applications to stochastic shortest path problems, two-player zero-sum Markov games, as well as policy evaluation and $Q$-learning for tabular Markov decision processes.
    Clipped DeepControl: deep neural network two-dimensional pulse design with an amplitude constraint layer. (arXiv:2201.08668v1 [physics.med-ph])
    (2 min) Advanced radio-frequency pulse design used in magnetic resonance imaging has recently been demonstrated with deep learning of (convolutional) neural networks and reinforcement learning. For two-dimensionally selective radio-frequency pulses, the (convolutional) neural network pulse prediction time (few milliseconds) was in comparison more than three orders of magnitude faster than the conventional optimal control computation. The network pulses were from the supervised training capable of compensating scan-subject dependent inhomogeneities of B0 and B+1 fields. Unfortunately, the network presented with a non-negligible percentage of pulse amplitude overshoots in the test subset, despite the optimal control pulses used in training were fully constrained. Here, we have extended the convolutional neural network with a custom-made clipping layer that completely eliminates the risk of pulse amplitude overshoots, while preserving the ability to compensate the inhomogeneous field conditions.
    Hold On and Swipe: A Touch-Movement Based Continuous Authentication Schema based on Machine Learning. (arXiv:2201.08564v1 [cs.CR])
    (2 min) In recent years the amount of secure information being stored on mobile devices has grown exponentially. However, current security schemas for mobile devices such as physiological biometrics and passwords are not secure enough to protect this information. Behavioral biometrics have been heavily researched as a possible solution to this security deficiency for mobile devices. This study aims to contribute to this innovative research by evaluating the performance of a multimodal behavioral biometric based user authentication scheme using touch dynamics and phone movement. This study uses a fusion of two popular publicly available datasets the Hand Movement Orientation and Grasp dataset and the BioIdent dataset. This study evaluates our model performance using three common machine learning algorithms which are Random Forest Support Vector Machine and K-Nearest Neighbor reaching accuracy rates as high as 82% with each algorithm performing respectively for all success metrics reported.
    Improved Random Features for Dot Product Kernels. (arXiv:2201.08712v1 [stat.ML])
    (2 min) Dot product kernels, such as polynomial and exponential (softmax) kernels, are among the most widely used kernels in machine learning, as they enable modeling the interactions between input features, which is crucial in applications like computer vision, natural language processing, and recommender systems. We make several novel contributions for improving the efficiency of random feature approximations for dot product kernels, to make these kernels more useful in large scale learning. First, we present a generalization of existing random feature approximations for polynomial kernels, such as Rademacher and Gaussian sketches and TensorSRHT, using complex-valued random features. We show empirically that the use of complex features can significantly reduce the variances of these approximations. Second, we provide a theoretical analysis for understanding the factors affecting the efficiency of various random feature approximations, by deriving closed-form expressions for their variances. These variance formulas elucidate conditions under which certain approximations (e.g., TensorSRHT) achieve lower variances than others (e.g, Rademacher sketch), and conditions under which the use of complex features leads to lower variances than real features. Third, by using these variance formulas, which can be evaluated in practice, we develop a data-driven optimization approach to random feature approximations for general dot product kernels, which is also applicable to the Gaussian kernel. We describe the improvements brought by these contributions with extensive experiments on a variety of tasks and datasets.
    Fast Differentiable Matrix Square Root. (arXiv:2201.08663v1 [cs.CV])
    (2 min) Computing the matrix square root or its inverse in a differentiable manner is important in a variety of computer vision tasks. Previous methods either adopt the Singular Value Decomposition (SVD) to explicitly factorize the matrix or use the Newton-Schulz iteration (NS iteration) to derive the approximate solution. However, both methods are not computationally efficient enough in either the forward pass or in the backward pass. In this paper, we propose two more efficient variants to compute the differentiable matrix square root. For the forward propagation, one method is to use Matrix Taylor Polynomial (MTP), and the other method is to use Matrix Pad\'e Approximants (MPA). The backward gradient is computed by iteratively solving the continuous-time Lyapunov equation using the matrix sign function. Both methods yield considerable speed-up compared with the SVD or the Newton-Schulz iteration. Experimental results on the de-correlated batch normalization and second-order vision transformer demonstrate that our methods can also achieve competitive and even slightly better performances. The code is available at \href{https://github.com/KingJamesSong/FastDifferentiableMatSqrt}{https://github.com/KingJamesSong/FastDifferentiableMatSqrt}.
    Identifying Adversarial Attacks on Text Classifiers. (arXiv:2201.08555v1 [cs.CL])
    (2 min) The landscape of adversarial attacks against text classifiers continues to grow, with new attacks developed every year and many of them available in standard toolkits, such as TextAttack and OpenAttack. In response, there is a growing body of work on robust learning, which reduces vulnerability to these attacks, though sometimes at a high cost in compute time or accuracy. In this paper, we take an alternate approach -- we attempt to understand the attacker by analyzing adversarial text to determine which methods were used to create it. Our first contribution is an extensive dataset for attack detection and labeling: 1.5~million attack instances, generated by twelve adversarial attacks targeting three classifiers trained on six source datasets for sentiment analysis and abuse detection in English. As our second contribution, we use this dataset to develop and benchmark a number of classifiers for attack identification -- determining if a given text has been adversarially manipulated and by which attack. As a third contribution, we demonstrate the effectiveness of three classes of features for these tasks: text properties, capturing content and presentation of text; language model properties, determining which tokens are more or less probable throughout the input; and target model properties, representing how the text classifier is influenced by the attack, including internal node activations. Overall, this represents a first step towards forensics for adversarial attacks against text classifiers.
    A phase transition for finding needles in nonlinear haystacks with LASSO artificial neural networks. (arXiv:2201.08652v1 [stat.ML])
    (2 min) To fit sparse linear associations, a LASSO sparsity inducing penalty with a single hyperparameter provably allows to recover the important features (needles) with high probability in certain regimes even if the sample size is smaller than the dimension of the input vector (haystack). More recently learners known as artificial neural networks (ANN) have shown great successes in many machine learning tasks, in particular fitting nonlinear associations. Small learning rate, stochastic gradient descent algorithm and large training set help to cope with the explosion in the number of parameters present in deep neural networks. Yet few ANN learners have been developed and studied to find needles in nonlinear haystacks. Driven by a single hyperparameter, our ANN learner, like for sparse linear associations, exhibits a phase transition in the probability of retrieving the needles, which we do not observe with other ANN learners. To select our penalty parameter, we generalize the universal threshold of Donoho and Johnstone (1994) which is a better rule than the conservative (too many false detections) and expensive cross-validation. In the spirit of simulated annealing, we propose a warm-start sparsity inducing algorithm to solve the high-dimensional, non-convex and non-differentiable optimization problem. We perform precise Monte Carlo simulations to show the effectiveness of our approach.
    Individual Treatment Effect Estimation Through Controlled Neural Network Training in Two Stages. (arXiv:2201.08559v1 [cs.LG])
    (2 min) We develop a Causal-Deep Neural Network (CDNN) model trained in two stages to infer causal impact estimates at an individual unit level. Using only the pre-treatment features in stage 1 in the absence of any treatment information, we learn an encoding for the covariates that best represents the outcome. In the $2^{nd}$ stage we further seek to predict the unexplained outcome from stage 1, by introducing the treatment indicator variables alongside the encoded covariates. We prove that even without explicitly computing the treatment residual, our method still satisfies the desirable local Neyman orthogonality, making it robust to small perturbations in the nuisance parameters. Furthermore, by establishing connections with the representation learning approaches, we create a framework from which multiple variants of our algorithm can be derived. We perform initial experiments on the publicly available data sets to compare these variants and get guidance in selecting the best variant of our CDNN method. On evaluating CDNN against the state-of-the-art approaches on three benchmarking datasets, we observe that CDNN is highly competitive and often yields the most accurate individual treatment effect estimates. We highlight the strong merits of CDNN in terms of its extensibility to multiple use cases.
    The Security of Deep Learning Defences for Medical Imaging. (arXiv:2201.08661v1 [cs.CR])
    (2 min) Deep learning has shown great promise in the domain of medical image analysis. Medical professionals and healthcare providers have been adopting the technology to speed up and enhance their work. These systems use deep neural networks (DNN) which are vulnerable to adversarial samples; images with imperceivable changes that can alter the model's prediction. Researchers have proposed defences which either make a DNN more robust or detect the adversarial samples before they do harm. However, none of these works consider an informed attacker which can adapt to the defence mechanism. We show that an informed attacker can evade five of the current state of the art defences while successfully fooling the victim's deep learning model, rendering these defences useless. We then suggest better alternatives for securing healthcare DNNs from such attacks: (1) harden the system's security and (2) use digital signatures.
    How does unlabeled data improve generalization in self-training? A one-hidden-layer theoretical analysis. (arXiv:2201.08514v1 [cs.LG])
    (2 min) Self-training, a semi-supervised learning algorithm, leverages a large amount of unlabeled data to improve learning when the labeled data are limited. Despite empirical successes, its theoretical characterization remains elusive. To the best of our knowledge, this work establishes the first theoretical analysis for the known iterative self-training paradigm and proves the benefits of unlabeled data in both training convergence and generalization ability. To make our theoretical analysis feasible, we focus on the case of one-hidden-layer neural networks. However, theoretical understanding of iterative self-training is non-trivial even for a shallow neural network. One of the key challenges is that existing neural network landscape analysis built upon supervised learning no longer holds in the (semi-supervised) self-training paradigm. We address this challenge and prove that iterative self-training converges linearly with both convergence rate and generalization accuracy improved in the order of $1/\sqrt{M}$, where $M$ is the number of unlabeled samples. Experiments from shallow neural networks to deep neural networks are also provided to justify the correctness of our established theoretical insights on self-training.
    On the adaptation of recurrent neural networks for system identification. (arXiv:2201.08660v1 [cs.LG])
    (2 min) This paper presents a transfer learning approach which enables fast and efficient adaptation of Recurrent Neural Network (RNN) models of dynamical systems. A nominal RNN model is first identified using available measurements. The system dynamics are then assumed to change, leading to an unacceptable degradation of the nominal model performance on the perturbed system. To cope with the mismatch, the model is augmented with an additive correction term trained on fresh data from the new dynamic regime. The correction term is learned through a Jacobian Feature Regression (JFR) method defined in terms of the features spanned by the model's Jacobian with respect to its nominal parameters. A non-parametric view of the approach is also proposed, which extends recent work on Gaussian Process (GP) with Neural Tangent Kernel (NTK-GP) to the RNN case (RNTK-GP). This can be more efficient for very large networks or when only few data points are available. Implementation aspects for fast and efficient computation of the correction term, as well as the initial state estimation for the RNN model are described. Numerical examples show the effectiveness of the proposed methodology in presence of significant system variations.
    TOFU: Towards Obfuscated Federated Updates by Encoding Weight Updates into Gradients from Proxy Data. (arXiv:2201.08494v1 [cs.LG])
    (2 min) Advances in Federated Learning and an abundance of user data have enabled rich collaborative learning between multiple clients, without sharing user data. This is done via a central server that aggregates learning in the form of weight updates. However, this comes at the cost of repeated expensive communication between the clients and the server, and concerns about compromised user privacy. The inversion of gradients into the data that generated them is termed data leakage. Encryption techniques can be used to counter this leakage, but at added expense. To address these challenges of communication efficiency and privacy, we propose TOFU, a novel algorithm which generates proxy data that encodes the weight updates for each client in its gradients. Instead of weight updates, this proxy data is now shared. Since input data is far lower in dimensional complexity than weights, this encoding allows us to send much lesser data per communication round. Additionally, the proxy data resembles noise, and even perfect reconstruction from data leakage attacks would invert the decoded gradients into unrecognizable noise, enhancing privacy. We show that TOFU enables learning with less than 1% and 7% accuracy drops on MNIST and on CIFAR-10 datasets, respectively. This drop can be recovered via a few rounds of expensive encrypted gradient exchange. This enables us to learn to near-full accuracy in a federated setup, while being 4x and 6.6x more communication efficient than the standard Federated Averaging algorithm on MNIST and CIFAR-10, respectively.
    Distance-Ratio-Based Formulation for Metric Learning. (arXiv:2201.08676v1 [cs.LG])
    (2 min) In metric learning, the goal is to learn an embedding so that data points with the same class are close to each other and data points with different classes are far apart. We propose a distance-ratio-based (DR) formulation for metric learning. Like softmax-based formulation for metric learning, it models $p(y=c|x')$, which is a probability that a query point $x'$ belongs to a class $c$. The DR formulation has two useful properties. First, the corresponding loss is not affected by scale changes of an embedding. Second, it outputs the optimal (maximum or minimum) classification confidence scores on representing points for classes. To demonstrate the effectiveness of our formulation, we conduct few-shot classification experiments using softmax-based and DR formulations on CUB and mini-ImageNet datasets. The results show that DR formulation generally enables faster and more stable metric learning than the softmax-based formulation. As a result, using DR formulation achieves improved or comparable generalization performances.
    Fair Node Representation Learning via Adaptive Data Augmentation. (arXiv:2201.08549v1 [cs.LG])
    (2 min) Node representation learning has demonstrated its efficacy for various applications on graphs, which leads to increasing attention towards the area. However, fairness is a largely under-explored territory within the field, which may lead to biased results towards underrepresented groups in ensuing tasks. To this end, this work theoretically explains the sources of bias in node representations obtained via Graph Neural Networks (GNNs). Our analysis reveals that both nodal features and graph structure lead to bias in the obtained representations. Building upon the analysis, fairness-aware data augmentation frameworks on nodal features and graph structure are developed to reduce the intrinsic bias. Our analysis and proposed schemes can be readily employed to enhance the fairness of various GNN-based learning mechanisms. Extensive experiments on node classification and link prediction are carried out over real networks in the context of graph contrastive learning. Comparison with multiple benchmarks demonstrates that the proposed augmentation strategies can improve fairness in terms of statistical parity and equal opportunity, while providing comparable utility to state-of-the-art contrastive methods.
    Robust Unsupervised Graph Representation Learning via Mutual Information Maximization. (arXiv:2201.08557v1 [cs.LG])
    (2 min) Recent studies have shown that GNNs are vulnerable to adversarial attack. Thus, many approaches are proposed to improve the robustness of GNNs against adversarial attacks. Nevertheless, most of these methods measure the model robustness based on label information and thus become infeasible when labels information is not available. Therefore, this paper focuses on robust unsupervised graph representation learning. In particular, to quantify the robustness of GNNs without label information, we propose a robustness measure, named graph representation robustness (GRR), to evaluate the mutual information between adversarially perturbed node representations and the original graph. There are mainly two challenges to estimate GRR: 1) mutual information estimation upon adversarially attacked graphs; 2) high complexity of adversarial attack to perturb node features and graph structure jointly in the training procedure. To tackle these problems, we further propose an effective mutual information estimator with subgraph-level summary and an efficient adversarial training strategy with only feature perturbations. Moreover, we theoretically establish a connection between our proposed GRR measure and the robustness of downstream classifiers, which reveals that GRR can provide a lower bound to the adversarial risk of downstream classifiers. Extensive experiments over several benchmarks demonstrate the effectiveness and superiority of our proposed method.
    Training Hybrid Classical-Quantum Classifiers via Stochastic Variational Optimization. (arXiv:2201.08629v1 [quant-ph])
    (2 min) Quantum machine learning has emerged as a potential practical application of near-term quantum devices. In this work, we study a two-layer hybrid classical-quantum classifier in which a first layer of quantum stochastic neurons implementing generalized linear models (QGLMs) is followed by a second classical combining layer. The input to the first, hidden, layer is obtained via amplitude encoding in order to leverage the exponential size of the fan-in of the quantum neurons in the number of qubits per neuron. To facilitate implementation of the QGLMs, all weights and activations are binary. While the state of the art on training strategies for this class of models is limited to exhaustive search and single-neuron perceptron-like bit-flip strategies, this letter introduces a stochastic variational optimization approach that enables the joint training of quantum and classical layers via stochastic gradient descent. Experiments show the advantages of the approach for a variety of activation functions implemented by QGLM neurons.
    Deep Learning-Accelerated 3D Carbon Storage Reservoir Pressure Forecasting Based on Data Assimilation Using Surface Displacement from InSAR. (arXiv:2201.08543v1 [stat.ML])
    (2 min) Fast forecasting of reservoir pressure distribution in geologic carbon storage (GCS) by assimilating monitoring data is a challenging problem. Due to high drilling cost, GCS projects usually have spatially sparse measurements from wells, leading to high uncertainties in reservoir pressure prediction. To address this challenge, we propose to use low-cost Interferometric Synthetic-Aperture Radar (InSAR) data as monitoring data to infer reservoir pressure build up. We develop a deep learning-accelerated workflow to assimilate surface displacement maps interpreted from InSAR and to forecast dynamic reservoir pressure. Employing an Ensemble Smoother Multiple Data Assimilation (ES-MDA) framework, the workflow updates three-dimensional (3D) geologic properties and predicts reservoir pressure with quantified uncertainties. We use a synthetic commercial-scale GCS model with bimodally distributed permeability and porosity to demonstrate the efficacy of the workflow. A two-step CNN-PCA approach is employed to parameterize the bimodal fields. The computational efficiency of the workflow is boosted by two residual U-Net based surrogate models for surface displacement and reservoir pressure predictions, respectively. The workflow can complete data assimilation and reservoir pressure forecasting in half an hour on a personal computer.
    Equivalent Distance Geometry Error for Molecular Conformation Comparison. (arXiv:2201.08714v1 [q-bio.BM])
    (2 min) Straight-forward conformation generation models, which generate 3-D structures directly from input molecular graphs, play an important role in various molecular tasks with machine learning, such as 3D-QSAR and virtual screening in drug design. However, existing loss functions in these models either cost overmuch time or fail to guarantee the equivalence during optimization, which means treating different items unfairly, resulting in poor local geometry in generated conformation. So, we propose Equivalent Distance Geometry Error (EDGE) to calculate the differential discrepancy between conformations where the essential factors of three kinds in conformation geometry (i.e. bond lengths, bond angles and dihedral angles) are equivalently optimized with certain weights. And in the improved version of our method, the optimization features minimizing linear transformations of atom-pair distances within 3-hop. Extensive experiments show that, compared with existing loss functions, EDGE performs effectively and efficiently in two tasks under the same backbones.
    LRSVRG-IMC: An SVRG-Based Algorithm for LowRank Inductive Matrix Completion. (arXiv:2201.08516v1 [cs.LG])
    (2 min) Low-rank inductive matrix completion (IMC) is currently widely used in IoT data completion, recommendation systems, and so on, as the side information in IMC has demonstrated great potential in reducing sample point remains a major obstacle for the convergence of the nonconvex solutions to IMC. What's more, carefully choosing the initial solution alone does not usually help remove the saddle points. To address this problem, we propose a stocastic variance reduction gradient-based algorithm called LRSVRG-IMC. LRSVRG-IMC can escape from the saddle points under various low-rank and sparse conditions with a properly chosen initial input. We also prove that LRSVVRG-IMC achieves both a linear convergence rate and a near-optimal sample complexity. The superiority and applicability of LRSVRG-IMC are verified via experiments on synthetic datasets.
    A fuzzy-rough uncertainty measure to discover bias encoded explicitly or implicitly in features of structured pattern classification datasets. (arXiv:2108.09098v2 [cs.LG] UPDATED)
    (0 min) The need to measure bias encoded in tabular data that are used to solve pattern recognition problems is widely recognized by academia, legislators and enterprises alike. In previous work, we proposed a bias quantification measure, called fuzzy-rough uncer-tainty, which relies on the fuzzy-rough set theory. The intuition dictates that protected features should not change the fuzzy-rough boundary regions of a decision class significantly. The extent to which this happens is a proxy for bias expressed as uncertainty in adecision-making context. Our measure's main advantage is that it does not depend on any machine learning prediction model but adistance function. In this paper, we extend our study by exploring the existence of bias encoded implicitly in non-protected featuresas defined by the correlation between protected and unprotected attributes. This analysis leads to four scenarios that domain experts should evaluate before deciding how to tackle bias. In addition, we conduct a sensitivity analysis to determine the fuzzy operatorsand distance function that best capture change in the boundary regions.
    Enhancing Hyperbolic Graph Embeddings via Contrastive Learning. (arXiv:2201.08554v1 [cs.LG])
    (2 min) Recently, hyperbolic space has risen as a promising alternative for semi-supervised graph representation learning. Many efforts have been made to design hyperbolic versions of neural network operations. However, the inspiring geometric properties of this unique geometry have not been fully explored yet. The potency of graph models powered by the hyperbolic space is still largely underestimated. Besides, the rich information carried by abundant unlabelled samples is also not well utilized. Inspired by the recently active and emerging self-supervised learning, in this study, we attempt to enhance the representation power of hyperbolic graph models by drawing upon the advantages of contrastive learning. More specifically, we put forward a novel Hyperbolic Graph Contrastive Learning (HGCL) framework which learns node representations through multiple hyperbolic spaces to implicitly capture the hierarchical structure shared between different views. Then, we design a hyperbolic position consistency (HPC) constraint based on hyperbolic distance and the homophily assumption to make contrastive learning fit into hyperbolic space. Experimental results on multiple real-world datasets demonstrate the superiority of the proposed HGCL as it consistently outperforms competing methods by considerable margins for the node classification task.
    RoboMal: Malware Detection for Robot Network Systems. (arXiv:2201.08470v1 [cs.RO])
    (2 min) Robot systems are increasingly integrating into numerous avenues of modern life. From cleaning houses to providing guidance and emotional support, robots now work directly with humans. Due to their far-reaching applications and progressively complex architecture, they are being targeted by adversarial attacks such as sensor-actuator attacks, data spoofing, malware, and network intrusion. Therefore, security for robotic systems has become crucial. In this paper, we address the underserved area of malware detection in robotic software. Since robots work in close proximity to humans, often with direct interactions, malware could have life-threatening impacts. Hence, we propose the RoboMal framework of static malware detection on binary executables to detect malware before it gets a chance to execute. Additionally, we address the great paucity of data in this space by providing the RoboMal dataset comprising controller executables of a small-scale autonomous car. The performance of the framework is compared against widely used supervised learning models: GRU, CNN, and ANN. Notably, the LSTM-based RoboMal model outperforms the other models with an accuracy of 85% and precision of 87% in 10-fold cross-validation, hence proving the effectiveness of the proposed framework.
    Post-Training Detection of Backdoor Attacks for Two-Class and Multi-Attack Scenarios. (arXiv:2201.08474v1 [cs.CR])
    (2 min) Backdoor attacks (BAs) are an emerging threat to deep neural network classifiers. A victim classifier will predict to an attacker-desired target class whenever a test sample is embedded with the same backdoor pattern (BP) that was used to poison the classifier's training set. Detecting whether a classifier is backdoor attacked is not easy in practice, especially when the defender is, e.g., a downstream user without access to the classifier's training set. This challenge is addressed here by a reverse-engineering defense (RED), which has been shown to yield state-of-the-art performance in several domains. However, existing REDs are not applicable when there are only {\it two classes} or when {\it multiple attacks} are present. These scenarios are first studied in the current paper, under the practical constraints that the defender neither has access to the classifier's training set nor to supervision from clean reference classifiers trained for the same domain. We propose a detection framework based on BP reverse-engineering and a novel {\it expected transferability} (ET) statistic. We show that our ET statistic is effective {\it using the same detection threshold}, irrespective of the classification domain, the attack configuration, and the BP reverse-engineering algorithm that is used. The excellent performance of our method is demonstrated on six benchmark datasets. Notably, our detection framework is also applicable to multi-class scenarios with multiple attacks.
    Android Malware Detection using Feature Ranking of Permissions. (arXiv:2201.08468v1 [cs.CR])
    (2 min) We investigate the use of Android permissions as the vehicle to allow for quick and effective differentiation between benign and malware apps. To this end, we extract all Android permissions, eliminating those that have zero impact, and apply two feature ranking algorithms namely Chi-Square test and Fisher's Exact test to rank and additionally filter them, resulting in a comparatively small set of relevant permissions. Then we use Decision Tree, Support Vector Machine, and Random Forest Classifier algorithms to detect malware apps. Our analysis indicates that this approach can result in better accuracy and F-score value than other reported approaches. In particular, when random forest is used as the classifier with the combination of Fisher's Exact test, we achieve 99.34\% in accuracy and 92.17\% in F-score with the false positive rate of 0.56\% for the dataset in question, with results improving to 99.82\% in accuracy and 95.28\% in F-score with the false positive rate as low as 0.05\% when only malware from three most popular malware families are considered.
    Deep Attention-Based Supernovae Classification of Multi-Band Light-Curves. (arXiv:2201.08482v1 [astro-ph.IM])
    (2 min) In astronomical surveys, such as the Zwicky Transient Facility (ZTF), supernovae (SNe) are relatively uncommon objects compared to other classes of variable events. Along with this scarcity, the processing of multi-band light-curves is a challenging task due to the highly irregular cadence, long time gaps, missing-values, low number of observations, etc. These issues are particularly detrimental for the analysis of transient events with SN-like light-curves. In this work, we offer three main contributions. First, based on temporal modulation and attention mechanisms, we propose a Deep Attention model called TimeModAttn to classify multi-band light-curves of different SN types, avoiding photometric or hand-crafted feature computations, missing-values assumptions, and explicit imputation and interpolation methods. Second, we propose a model for the synthetic generation of SN multi-band light-curves based on the Supernova Parametric Model (SPM). This allows us to increase the number of samples and the diversity of the cadence. The TimeModAttn model is first pre-trained using synthetic light-curves in a semi-supervised learning scheme. Then, a fine-tuning process is performed for domain adaptation. The proposed TimeModAttn model outperformed a Random Forest classifier, increasing the balanced-$F_1$score from $\approx.525$ to $\approx.596$. The TimeModAttn model also outperformed other Deep Learning models, based on Recurrent Neural Networks (RNNs), in two scenarios: late-classification and early-classification. Finally, we conduct interpretability experiments. High attention scores are obtained for observations earlier than and close to the SN brightness peaks, which are supported by an early and highly expressive learned temporal modulation.
    DDPG-Driven Deep-Unfolding with Adaptive Depth for Channel Estimation with Sparse Bayesian Learning. (arXiv:2201.08477v1 [eess.SP])
    (2 min) Deep-unfolding neural networks (NNs) have received great attention since they achieve satisfactory performance with relatively low complexity. Typically, these deep-unfolding NNs are restricted to a fixed-depth for all inputs. However, the optimal number of layers required for convergence changes with different inputs. In this paper, we first develop a framework of deep deterministic policy gradient (DDPG)-driven deep-unfolding with adaptive depth for different inputs, where the trainable parameters of deep-unfolding NN are learned by DDPG, rather than updated by the stochastic gradient descent algorithm directly. Specifically, the optimization variables, trainable parameters, and architecture of deep-unfolding NN are designed as the state, action, and state transition of DDPG, respectively. Then, this framework is employed to deal with the channel estimation problem in massive multiple-input multiple-output systems. Specifically, first of all we formulate the channel estimation problem with an off-grid basis and develop a sparse Bayesian learning (SBL)-based algorithm to solve it. Secondly, the SBL-based algorithm is unfolded into a layer-wise structure with a set of introduced trainable parameters. Thirdly, the proposed DDPG-driven deep-unfolding framework is employed to solve this channel estimation problem based on the unfolded structure of the SBL-based algorithm. To realize adaptive depth, we design the halting score to indicate when to stop, which is a function of the channel reconstruction error. Furthermore, the proposed framework is extended to realize the adaptive depth of the general deep neural networks (DNNs). Simulation results show that the proposed algorithm outperforms the conventional optimization algorithms and DNNs with fixed depth with much reduced number of layers.
    alpha-Deep Probabilistic Inference (alpha-DPI): efficient uncertainty quantification from exoplanet astrometry to black hole feature extraction. (arXiv:2201.08506v1 [astro-ph.IM])
    (2 min) Inference is crucial in modern astronomical research, where hidden astrophysical features and patterns are often estimated from indirect and noisy measurements. Inferring the posterior of hidden features, conditioned on the observed measurements, is essential for understanding the uncertainty of results and downstream scientific interpretations. Traditional approaches for posterior estimation include sampling-based methods and variational inference. However, sampling-based methods are typically slow for high-dimensional inverse problems, while variational inference often lacks estimation accuracy. In this paper, we propose alpha-DPI, a deep learning framework that first learns an approximate posterior using alpha-divergence variational inference paired with a generative neural network, and then produces more accurate posterior samples through importance re-weighting of the network samples. It inherits strengths from both sampling and variational inference methods: it is fast, accurate, and scalable to high-dimensional problems. We apply our approach to two high-impact astronomical inference problems using real data: exoplanet astrometry and black hole feature extraction.
    Deep reinforcement learning under signal temporal logic constraints using Lagrangian relaxation. (arXiv:2201.08504v1 [stat.ML])
    (2 min) Deep reinforcement learning (DRL) has attracted much attention as an approach to solve sequential decision making problems without mathematical models of systems or environments. In general, a constraint may be imposed on the decision making. In this study, we consider the optimal decision making problems with constraints to complete temporal high-level tasks in the continuous state-action domain. We describe the constraints using signal temporal logic (STL), which is useful for time sensitive control tasks since it can specify continuous signals within a bounded time interval. To deal with the STL constraints, we introduce an extended constrained Markov decision process (CMDP), which is called a $\tau$-CMDP. We formulate the STL constrained optimal decision making problem as the $\tau$-CMDP and propose a two-phase constrained DRL algorithm using the Lagrangian relaxation method. Through simulations, we also demonstrate the learning performance of the proposed algorithm.
    GenGNN: A Generic FPGA Framework for Graph Neural Network Acceleration. (arXiv:2201.08475v1 [cs.LG])
    (2 min) Graph neural networks (GNNs) have recently exploded in popularity thanks to their broad applicability to ubiquitous graph-related problems such as quantum chemistry, drug discovery, and high energy physics. However, meeting demand for novel GNN models and fast inference simultaneously is challenging because of the gap between the difficulty in developing efficient FPGA accelerators and the rapid pace of creation of new GNN models. Prior art focuses on the acceleration of specific classes of GNNs but lacks the generality to work across existing models or to extend to new and emerging GNN models. In this work, we propose a generic GNN acceleration framework using High-Level Synthesis (HLS), named GenGNN, with two-fold goals. First, we aim to deliver ultra-fast GNN inference without any graph pre-processing for real-time requirements. Second, we aim to support a diverse set of GNN models with the extensibility to flexibly adapt to new models. The framework features an optimized message-passing structure applicable to all models, combined with a rich library of model-specific components. We verify our implementation on-board on the Xilinx Alveo U50 FPGA and observe a speed-up of up to 25x against CPU (6226R) baseline and 13x against GPU (A6000) baseline. Our HLS code will be open-source on GitHub upon acceptance.
    Iterated Reasoning with Mutual Information in Cooperative and Byzantine Decentralized Teaming. (arXiv:2201.08484v1 [cs.MA])
    (2 min) Information sharing is key in building team cognition and enables coordination and cooperation. High-performing human teams also benefit from acting strategically with hierarchical levels of iterated communication and rationalizability, meaning a human agent can reason about the actions of their teammates in their decision-making. Yet, the majority of prior work in Multi-Agent Reinforcement Learning (MARL) does not support iterated rationalizability and only encourage inter-agent communication, resulting in a suboptimal equilibrium cooperation strategy. In this work, we show that reformulating an agent's policy to be conditional on the policies of its neighboring teammates inherently maximizes Mutual Information (MI) lower-bound when optimizing under Policy Gradient (PG). Building on the idea of decision-making under bounded rationality and cognitive hierarchy theory, we show that our modified PG approach not only maximizes local agent rewards but also implicitly reasons about MI between agents without the need for any explicit ad-hoc regularization terms. Our approach, InfoPG, outperforms baselines in learning emergent collaborative behaviors and sets the state-of-the-art in decentralized cooperative MARL tasks. Our experiments validate the utility of InfoPG by achieving higher sample efficiency and significantly larger cumulative reward in several complex cooperative multi-agent domains.
    Federated Learning with Heterogeneous Architectures using Graph HyperNetworks. (arXiv:2201.08459v1 [cs.LG])
    (2 min) Standard Federated Learning (FL) techniques are limited to clients with identical network architectures. This restricts potential use-cases like cross-platform training or inter-organizational collaboration when both data privacy and architectural proprietary are required. We propose a new FL framework that accommodates heterogeneous client architecture by adopting a graph hypernetwork for parameter sharing. A property of the graph hyper network is that it can adapt to various computational graphs, thereby allowing meaningful parameter sharing across models. Unlike existing solutions, our framework does not limit the clients to share the same architecture type, makes no use of external data and does not require clients to disclose their model architecture. Compared with distillation-based and non-graph hypernetwork baselines, our method performs notably better on standard benchmarks. We additionally show encouraging generalization performance to unseen architectures.
    An Empirical Investigation of Model-to-Model Distribution Shifts in Trained Convolutional Filters. (arXiv:2201.08465v1 [cs.CV])
    (2 min) We present first empirical results from our ongoing investigation of distribution shifts in image data used for various computer vision tasks. Instead of analyzing the original training and test data, we propose to study shifts in the learned weights of trained models. In this work, we focus on the properties of the distributions of dominantly used 3x3 convolution filter kernels. We collected and publicly provide a data set with over half a billion filters from hundreds of trained CNNs, using a wide range of data sets, architectures, and vision tasks. Our analysis shows interesting distribution shifts (or the lack thereof) between trained filters along different axes of meta-parameters, like data type, task, architecture, or layer depth. We argue, that the observed properties are a valuable source for further investigation into a better understanding of the impact of shifts in the input data to the generalization abilities of CNN models and novel methods for more robust transfer-learning in this domain. Data available at: https://github.com/paulgavrikov/CNN-Filter-DB/.
    Hybrid Graph Models for Logic Optimization via Spatio-Temporal Information. (arXiv:2201.08455v1 [cs.LG])
    (2 min) Despite the stride made by machine learning (ML) based performance modeling, two major concerns that may impede production-ready ML applications in EDA are stringent accuracy requirements and generalization capability. To this end, we propose hybrid graph neural network (GNN) based approaches towards highly accurate quality-of-result (QoR) estimations with great generalization capability, specifically targeting logic synthesis optimization. The key idea is to simultaneously leverage spatio-temporal information from hardware designs and logic synthesis flows to forecast performance (i.e., delay/area) of various synthesis flows on different designs. The structural characteristics inside hardware designs are distilled and represented by GNNs; the temporal knowledge (i.e., relative ordering of logic transformations) in synthesis flows can be imposed on hardware designs by combining a virtually added supernode or a sequence processing model with conventional GNN models. Evaluation on 3.3 million data points shows that the testing mean absolute percentage error (MAPE) on designs seen and unseen during training are no more than 1.2% and 3.1%, respectively, which are 7-15X lower than existing studies.
    Kinit Classification in Ethiopian Chants, Azmaris and Modern Music: A New Dataset and CNN Benchmark. (arXiv:2201.08448v1 [cs.SD])
    (2 min) In this paper, we create EMIR, the first-ever Music Information Retrieval dataset for Ethiopian music. EMIR is freely available for research purposes and contains 600 sample recordings of Orthodox Tewahedo chants, traditional Azmari songs and contemporary Ethiopian secular music. Each sample is classified by five expert judges into one of four well-known Ethiopian Kinits, Tizita, Bati, Ambassel and Anchihoye. Each Kinit uses its own pentatonic scale and also has its own stylistic characteristics. Thus, Kinit classification needs to combine scale identification with genre recognition. After describing the dataset, we present the Ethio Kinits Model (EKM), based on VGG, for classifying the EMIR clips. In Experiment 1, we investigated whether Filterbank, Mel-spectrogram, Chroma, or Mel-frequency Cepstral coefficient (MFCC) features work best for Kinit classification using EKM. MFCC was found to be superior and was therefore adopted for Experiment 2, where the performance of EKM models using MFCC was compared using three different audio sample lengths. 3s length gave the best results. In Experiment 3, EKM and four existing models were compared on the EMIR dataset: AlexNet, ResNet50, VGG16 and LSTM. EKM was found to have the best accuracy (95.00%) as well as the fastest training time. We hope this work will encourage others to explore Ethiopian music and to experiment with other models for Kinit classification.
    Neural Network Quantization with AI Model Efficiency Toolkit (AIMET). (arXiv:2201.08442v1 [cs.LG])
    (2 min) While neural networks have advanced the frontiers in many machine learning applications, they often come at a high computational cost. Reducing the power and latency of neural network inference is vital to integrating modern networks into edge devices with strict power and compute requirements. Neural network quantization is one of the most effective ways of achieving these savings, but the additional noise it induces can lead to accuracy degradation. In this white paper, we present an overview of neural network quantization using AI Model Efficiency Toolkit (AIMET). AIMET is a library of state-of-the-art quantization and compression algorithms designed to ease the effort required for model optimization and thus drive the broader AI ecosystem towards low latency and energy-efficient inference. AIMET provides users with the ability to simulate as well as optimize PyTorch and TensorFlow models. Specifically for quantization, AIMET includes various post-training quantization (PTQ, cf. chapter 4) and quantization-aware training (QAT, cf. chapter 5) techniques that guarantee near floating-point accuracy for 8-bit fixed-point inference. We provide a practical guide to quantization via AIMET by covering PTQ and QAT workflows, code examples and practical tips that enable users to efficiently and effectively quantize models using AIMET and reap the benefits of low-bit integer inference.
    Regional Negative Bias in Word Embeddings Predicts Racial Animus--but only via Name Frequency. (arXiv:2201.08451v1 [cs.CL])
    (2 min) The word embedding association test (WEAT) is an important method for measuring linguistic biases against social groups such as ethnic minorities in large text corpora. It does so by comparing the semantic relatedness of words prototypical of the groups (e.g., names unique to those groups) and attribute words (e.g., 'pleasant' and 'unpleasant' words). We show that anti-black WEAT estimates from geo-tagged social media data at the level of metropolitan statistical areas strongly correlate with several measures of racial animus--even when controlling for sociodemographic covariates. However, we also show that every one of these correlations is explained by a third variable: the frequency of Black names in the underlying corpora relative to White names. This occurs because word embeddings tend to group positive (negative) words and frequent (rare) words together in the estimated semantic space. As the frequency of Black names on social media is strongly correlated with Black Americans' prevalence in the population, this results in spurious anti-Black WEAT estimates wherever few Black Americans live. This suggests that research using the WEAT to measure bias should consider term frequency, and also demonstrates the potential consequences of using black-box models like word embeddings to study human cognition and behavior.
    A Prescriptive Dirichlet Power Allocation Policy with Deep Reinforcement Learning. (arXiv:2201.08445v1 [cs.LG])
    (2 min) Prescribing optimal operation based on the condition of the system and, thereby, potentially prolonging the remaining useful lifetime has a large potential for actively managing the availability, maintenance and costs of complex systems. Reinforcement learning (RL) algorithms are particularly suitable for this type of problems given their learning capabilities. A special case of a prescriptive operation is the power allocation task, which can be considered as a sequential allocation problem, where the action space is bounded by a simplex constraint. A general continuous action-space solution of such sequential allocation problems has still remained an open research question for RL algorithms. In continuous action-space, the standard Gaussian policy applied in reinforcement learning does not support simplex constraints, while the Gaussian-softmax policy introduces a bias during training. In this work, we propose the Dirichlet policy for continuous allocation tasks and analyze the bias and variance of its policy gradients. We demonstrate that the Dirichlet policy is bias-free and provides significantly faster convergence, better performance and better hyperparameters robustness over the Gaussian-softmax policy. Moreover, we demonstrate the applicability of the proposed algorithm on a prescriptive operation case, where we propose the Dirichlet power allocation policy and evaluate the performance on a case study of a set of multiple lithium-ion (Li-I) battery systems. The experimental results show the potential to prescribe optimal operation, improve the efficiency and sustainability of multi-power source systems.
    DROPO: Sim-to-Real Transfer with Offline Domain Randomization. (arXiv:2201.08434v1 [cs.RO])
    (2 min) In recent years, domain randomization has gained a lot of traction as a method for sim-to-real transfer of reinforcement learning policies in robotic manipulation; however, finding optimal randomization distributions can be difficult. In this paper, we introduce DROPO, a novel method for estimating domain randomization distributions for safe sim-to-real transfer. Unlike prior work, DROPO only requires a limited, precollected offline dataset of trajectories, and explicitly models parameter uncertainty to match real data. We demonstrate that DROPO is capable of recovering dynamic parameter distributions in simulation and finding a distribution capable of compensating for an unmodelled phenomenon. We also evaluate the method in two zero-shot sim-to-real transfer scenarios, showing successful domain transfer and improved performance over prior methods.
    VUDENC: Vulnerability Detection with Deep Learning on a Natural Codebase for Python. (arXiv:2201.08441v1 [cs.CR])
    (2 min) Context: Identifying potential vulnerable code is important to improve the security of our software systems. However, the manual detection of software vulnerabilities requires expert knowledge and is time-consuming, and must be supported by automated techniques. Objective: Such automated vulnerability detection techniques should achieve a high accuracy, point developers directly to the vulnerable code fragments, scale to real-world software, generalize across the boundaries of a specific software project, and require no or only moderate setup or configuration effort. Method: In this article, we present VUDENC (Vulnerability Detection with Deep Learning on a Natural Codebase), a deep learning-based vulnerability detection tool that automatically learns features of vulnerable code from a large and real-world Python codebase. VUDENC applies a word2vec model to identify semantically similar code tokens and to provide a vector representation. A network of long-short-term memory cells (LSTM) is then used to classify vulnerable code token sequences at a fine-grained level, highlight the specific areas in the source code that are likely to contain vulnerabilities, and provide confidence levels for its predictions. Results: To evaluate VUDENC, we used 1,009 vulnerability-fixing commits from different GitHub repositories that contain seven different types of vulnerabilities (SQL injection, XSS, Command injection, XSRF, Remote code execution, Path disclosure, Open redirect) for training. In the experimental evaluation, VUDENC achieves a recall of 78%-87%, a precision of 82%-96%, and an F1 score of 80%-90%. VUDENC's code, the datasets for the vulnerabilities, and the Python corpus for the word2vec model are available for reproduction. Conclusions: Our experimental results suggest...
    A Visual Analytics Approach to Building Logistic Regression Models and its Application to Health Records. (arXiv:2201.08429v1 [cs.LG])
    (2 min) Multidimensional data analysis has become increasingly important in many fields, mainly due to current vast data availability and the increasing demand to extract knowledge from it. In most applications, the role of the final user is crucial to build proper machine learning models and to explain the patterns found in data. In this paper, we present an open unified approach for generating, evaluating, and applying regression models in high-dimensional data sets within a user-guided process. The approach is based on exposing a broad correlation panorama for attributes, by which the user can select relevant attributes to build and evaluate prediction models for one or more contexts. We name the approach UCReg (User-Centered Regression). We demonstrate effectiveness and efficiency of UCReg through the application of our framework to the analysis of Covid-19 and other synthetic and real health records data.
    Scalable Sampling for Nonsymmetric Determinantal Point Processes. (arXiv:2201.08417v1 [cs.LG])
    (2 min) A determinantal point process (DPP) on a collection of $M$ items is a model, parameterized by a symmetric kernel matrix, that assigns a probability to every subset of those items. Recent work shows that removing the kernel symmetry constraint, yielding nonsymmetric DPPs (NDPPs), can lead to significant predictive performance gains for machine learning applications. However, existing work leaves open the question of scalable NDPP sampling. There is only one known DPP sampling algorithm, based on Cholesky decomposition, that can directly apply to NDPPs as well. Unfortunately, its runtime is cubic in $M$, and thus does not scale to large item collections. In this work, we first note that this algorithm can be transformed into a linear-time one for kernels with low-rank structure. Furthermore, we develop a scalable sublinear-time rejection sampling algorithm by constructing a novel proposal distribution. Additionally, we show that imposing certain structural constraints on the NDPP kernel enables us to bound the rejection rate in a way that depends only on the kernel rank. In our experiments we compare the speed of all of these samplers for a variety of real-world tasks.
    Unicorn: Reasoning about Configurable System Performance through the lens of Causality. (arXiv:2201.08413v1 [cs.LG])
    (2 min) Modern computer systems are highly configurable, with the variability space sometimes larger than the number of atoms in the universe. Understanding and reasoning about the performance behavior of highly configurable systems, due to a vast variability space, is challenging. State-of-the-art methods for performance modeling and analyses rely on predictive machine learning models, therefore, they become (i) unreliable in unseen environments (e.g., different hardware, workloads), and (ii) produce incorrect explanations. To this end, we propose a new method, called Unicorn, which (a) captures intricate interactions between configuration options across the software-hardware stack and (b) describes how such interactions impact performance variations via causal inference. We evaluated Unicorn on six highly configurable systems, including three on-device machine learning systems, a video encoder, a database management system, and a data analytics pipeline. The experimental results indicate that Unicorn outperforms state-of-the-art performance optimization and debugging methods. Furthermore, unlike the existing methods, the learned causal performance models reliably predict performance for new environments.
    Reproducibility in Learning. (arXiv:2201.08430v1 [cs.LG])
    (2 min) We introduce the notion of a reproducible algorithm in the context of learning. A reproducible learning algorithm is resilient to variations in its samples -- with high probability, it returns the exact same output when run on two samples from the same underlying distribution. We begin by unpacking the definition, clarifying how randomness is instrumental in balancing accuracy and reproducibility. We initiate a theory of reproducible algorithms, showing how reproducibility implies desirable properties such as data reuse and efficient testability. Despite the exceedingly strong demand of reproducibility, there are efficient reproducible algorithms for several fundamental problems in statistics and learning. First, we show that any statistical query algorithm can be made reproducible with a modest increase in sample complexity, and we use this to construct reproducible algorithms for finding approximate heavy-hitters and medians. Using these ideas, we give the first reproducible algorithm for learning halfspaces via a reproducible weak learner and a reproducible boosting algorithm. Finally, we initiate the study of lower bounds and inherent tradeoffs for reproducible algorithms, giving nearly tight sample complexity upper and lower bounds for reproducible versus nonreproducible SQ algorithms.
  • stat.ML updates on arXiv.org

    Tuned Regularized Estimators for Linear Regression via Covariance Fitting. (arXiv:2201.08756v1 [math.ST])
    (0 min) We consider the problem of finding tuned regularized parameter estimators for linear models. We start by showing that three known optimal linear estimators belong to a wider class of estimators that can be formulated as a solution to a weighted and constrained minimization problem. The optimal weights, however, are typically unknown in many applications. This begs the question, how should we choose the weights using only the data? We propose using the covariance fitting SPICE-methodology to obtain data-adaptive weights and show that the resulting class of estimators yields tuned versions of known regularized estimators - such as ridge regression, LASSO, and regularized least absolute deviation. These theoretical results unify several important estimators under a common umbrella. The resulting tuned estimators are also shown to be practically relevant by means of a number of numerical examples.
    Gradients are Not All You Need. (arXiv:2111.05803v2 [cs.LG] UPDATED)
    (0 min) Differentiable programming techniques are widely used in the community and are responsible for the machine learning renaissance of the past several decades. While these methods are powerful, they have limits. In this short report, we discuss a common chaos based failure mode which appears in a variety of differentiable circumstances, ranging from recurrent neural networks and numerical physics simulation to training learned optimizers. We trace this failure to the spectrum of the Jacobian of the system under study, and provide criteria for when a practitioner might expect this failure to spoil their differentiation based optimization algorithms.
    A Non-Expert's Introduction to Data Ethics for Mathematicians. (arXiv:2201.07794v1 [math.HO] CROSS LISTED)
    (0 min) I give a short introduction to data ethics. My focal audience is mathematicians, but I hope that my discussion will also be useful to others. I am not an expert about data ethics, and my article is only a starting point. I encourage readers to examine the resources that I discuss and to continue to reflect carefully on data ethics and on the societal implications of data and data analysis throughout their lives.
    A generalization of the randomized singular value decomposition. (arXiv:2105.13052v3 [math.NA] UPDATED)
    (0 min) The randomized singular value decomposition (SVD) is a popular and effective algorithm for computing a near-best rank $k$ approximation of a matrix $A$ using matrix-vector products with standard Gaussian vectors. Here, we generalize the randomized SVD to multivariate Gaussian vectors, allowing one to incorporate prior knowledge of $A$ into the algorithm. This enables us to explore the continuous analogue of the randomized SVD for Hilbert--Schmidt (HS) operators using operator-function products with functions drawn from a Gaussian process (GP). We then construct a new covariance kernel for GPs, based on weighted Jacobi polynomials, which allows us to rapidly sample the GP and control the smoothness of the randomly generated functions. Numerical examples on matrices and HS operators demonstrate the applicability of the algorithm.
    Marginal Effects for Non-Linear Prediction Functions. (arXiv:2201.08837v1 [cs.LG])
    (0 min) Beta coefficients for linear regression models represent the ideal form of an interpretable feature effect. However, for non-linear models and especially generalized linear models, the estimated coefficients cannot be interpreted as a direct feature effect on the predicted outcome. Hence, marginal effects are typically used as approximations for feature effects, either in the shape of derivatives of the prediction function or forward differences in prediction due to a change in a feature value. While marginal effects are commonly used in many scientific fields, they have not yet been adopted as a model-agnostic interpretation method for machine learning models. This may stem from their inflexibility as a univariate feature effect and their inability to deal with the non-linearities found in black box models. We introduce a new class of marginal effects termed forward marginal effects. We argue to abandon derivatives in favor of better-interpretable forward differences. Furthermore, we generalize marginal effects based on forward differences to multivariate changes in feature values. To account for the non-linearity of prediction functions, we introduce a non-linearity measure for marginal effects. We argue against summarizing feature effects of a non-linear prediction function in a single metric such as the average marginal effect. Instead, we propose to partition the feature space to compute conditional average marginal effects on feature subspaces, which serve as conditional feature effect estimates.
    Robust and Fully-Dynamic Coreset for Continuous-and-Bounded Learning (With Outliers) Problems. (arXiv:2107.00068v2 [cs.LG] UPDATED)
    (0 min) In many machine learning tasks, a common approach for dealing with large-scale data is to build a small summary, {\em e.g.,} coreset, that can efficiently represent the original input. However, real-world datasets usually contain outliers and most existing coreset construction methods are not resilient against outliers (in particular, an outlier can be located arbitrarily in the space by an adversarial attacker). In this paper, we propose a novel robust coreset method for the {\em continuous-and-bounded learning} problems (with outliers) which includes a broad range of popular optimization objectives in machine learning, {\em e.g.,} logistic regression and $ k $-means clustering. Moreover, our robust coreset can be efficiently maintained in fully-dynamic environment. To the best of our knowledge, this is the first robust and fully-dynamic coreset construction method for these optimization problems. Another highlight is that our coreset size can depend on the doubling dimension of the parameter space, rather than the VC dimension of the objective function which could be very large or even challenging to compute. Finally, we conduct the experiments on real-world datasets to evaluate the effectiveness of our proposed robust coreset method.
    A phase transition for finding needles in nonlinear haystacks with LASSO artificial neural networks. (arXiv:2201.08652v1 [stat.ML])
    (0 min) To fit sparse linear associations, a LASSO sparsity inducing penalty with a single hyperparameter provably allows to recover the important features (needles) with high probability in certain regimes even if the sample size is smaller than the dimension of the input vector (haystack). More recently learners known as artificial neural networks (ANN) have shown great successes in many machine learning tasks, in particular fitting nonlinear associations. Small learning rate, stochastic gradient descent algorithm and large training set help to cope with the explosion in the number of parameters present in deep neural networks. Yet few ANN learners have been developed and studied to find needles in nonlinear haystacks. Driven by a single hyperparameter, our ANN learner, like for sparse linear associations, exhibits a phase transition in the probability of retrieving the needles, which we do not observe with other ANN learners. To select our penalty parameter, we generalize the universal threshold of Donoho and Johnstone (1994) which is a better rule than the conservative (too many false detections) and expensive cross-validation. In the spirit of simulated annealing, we propose a warm-start sparsity inducing algorithm to solve the high-dimensional, non-convex and non-differentiable optimization problem. We perform precise Monte Carlo simulations to show the effectiveness of our approach.
    Cost-sensitive Boosting Pruning Trees for depression detection on Twitter. (arXiv:1906.00398v3 [cs.LG] UPDATED)
    (0 min) Depression is one of the most common mental health disorders, and a large number of depressed people commit suicide each year. Potential depression sufferers usually do not consult psychological doctors because they feel ashamed or are unaware of any depression, which may result in severe delay of diagnosis and treatment. In the meantime, evidence shows that social media data provides valuable clues about physical and mental health conditions. In this paper, we argue that it is feasible to identify depression at an early stage by mining online social behaviours. Our approach, which is innovative to the practice of depression detection, does not rely on the extraction of numerous or complicated features to achieve accurate depression detection. Instead, we propose a novel classifier, namely, Cost-sensitive Boosting Pruning Trees (CBPT), which demonstrates a strong classification ability on two publicly accessible Twitter depression detection datasets. To comprehensively evaluate the classification capability of the CBPT, we use additional three datasets from the UCI machine learning repository and the CBPT obtains appealing classification results against several state of the arts boosting algorithms. Finally, we comprehensively explore the influence factors of model prediction, and the results manifest that our proposed framework is promising for identifying Twitter users with depression.
    Perturbation Bounds for (Nearly) Orthogonally Decomposable Tensors. (arXiv:2007.09024v2 [math.NA] UPDATED)
    (0 min) We develop deterministic perturbation bounds for singular values and vectors of orthogonally decomposable tensors, in a spirit similar to classical results for matrices such as those due to Weyl, Davis, Kahan and Wedin. Our bounds demonstrate intriguing differences between matrices and higher-order tensors. Most notably, they indicate that for higher-order tensors perturbation affects each essential singular value/vector in isolation, and its effect on an essential singular vector does not depend on the multiplicity of its corresponding singular value or its distance from other singular values. Our results can be readily applied and provide a unified treatment to many different problems in statistics and machine learning involving spectral learning of higher-order orthogonally decomposable tensors. In particular, we illustrate the implications of our bounds in the context of high dimensional tensor SVD problem, and how it can be used to derive optimal rates of convergence for spectral learning.
    Improved Random Features for Dot Product Kernels. (arXiv:2201.08712v1 [stat.ML])
    (0 min) Dot product kernels, such as polynomial and exponential (softmax) kernels, are among the most widely used kernels in machine learning, as they enable modeling the interactions between input features, which is crucial in applications like computer vision, natural language processing, and recommender systems. We make several novel contributions for improving the efficiency of random feature approximations for dot product kernels, to make these kernels more useful in large scale learning. First, we present a generalization of existing random feature approximations for polynomial kernels, such as Rademacher and Gaussian sketches and TensorSRHT, using complex-valued random features. We show empirically that the use of complex features can significantly reduce the variances of these approximations. Second, we provide a theoretical analysis for understanding the factors affecting the efficiency of various random feature approximations, by deriving closed-form expressions for their variances. These variance formulas elucidate conditions under which certain approximations (e.g., TensorSRHT) achieve lower variances than others (e.g, Rademacher sketch), and conditions under which the use of complex features leads to lower variances than real features. Third, by using these variance formulas, which can be evaluated in practice, we develop a data-driven optimization approach to random feature approximations for general dot product kernels, which is also applicable to the Gaussian kernel. We describe the improvements brought by these contributions with extensive experiments on a variety of tasks and datasets.
    Preventing Value Function Collapse in Ensemble {Q}-Learning by Maximizing Representation Diversity. (arXiv:2006.13823v3 [cs.LG] UPDATED)
    (0 min) The classic DQN algorithm is limited by the overestimation bias of the learned Q-function. Subsequent algorithms have proposed techniques to reduce this problem, without fully eliminating it. Recently, the Maxmin and Ensemble Q-learning algorithms have used different estimates provided by the ensembles of learners to reduce the overestimation bias. Unfortunately, these learners can converge to the same point in the parametric or representation space, falling back to the classic single neural network DQN. In this paper, we describe a regularization technique to maximize ensemble diversity in these algorithms. We propose and compare five regularization functions inspired from economics theory and consensus optimization. We show that the regularized approach significantly outperforms the Maxmin and Ensemble Q-learning algorithms as well as non-ensemble baselines.
    Dictionary-based Low-Rank Approximations and the Mixed Sparse Coding problem. (arXiv:2111.12399v2 [cs.LG] UPDATED)
    (0 min) Constrained tensor and matrix factorization models allow to extract interpretable patterns from multiway data. Therefore identifiability properties and efficient algorithms for constrained low-rank approximations are nowadays important research topics. This work deals with columns of factor matrices of a low-rank approximation being sparse in a known and possibly overcomplete basis, a model coined as Dictionary-based Low-Rank Approximation (DLRA). While earlier contributions focused on finding factor columns inside a dictionary of candidate columns, i.e. one-sparse approximations, this work is the first to tackle DLRA with sparsity larger than one. I propose to focus on the sparse-coding subproblem coined Mixed Sparse-Coding (MSC) that emerges when solving DLRA with an alternating optimization strategy. Several algorithms based on sparse-coding heuristics (greedy methods, convex relaxations) are provided to solve MSC. The performance of these heuristics is evaluated on simulated data. Then, I show how to adapt an efficient MSC solver based on the LASSO to compute Dictionary-based Matrix Factorization and Canonical Polyadic Decomposition in the context of hyperspectral image processing and chemometrics. These experiments suggest that DLRA extends the modeling capabilities of low-rank approximations, helps reducing estimation variance and enhances the identifiability and interpretability of estimated factors.
    Scalable Sampling for Nonsymmetric Determinantal Point Processes. (arXiv:2201.08417v1 [cs.LG])
    (0 min) A determinantal point process (DPP) on a collection of $M$ items is a model, parameterized by a symmetric kernel matrix, that assigns a probability to every subset of those items. Recent work shows that removing the kernel symmetry constraint, yielding nonsymmetric DPPs (NDPPs), can lead to significant predictive performance gains for machine learning applications. However, existing work leaves open the question of scalable NDPP sampling. There is only one known DPP sampling algorithm, based on Cholesky decomposition, that can directly apply to NDPPs as well. Unfortunately, its runtime is cubic in $M$, and thus does not scale to large item collections. In this work, we first note that this algorithm can be transformed into a linear-time one for kernels with low-rank structure. Furthermore, we develop a scalable sublinear-time rejection sampling algorithm by constructing a novel proposal distribution. Additionally, we show that imposing certain structural constraints on the NDPP kernel enables us to bound the rejection rate in a way that depends only on the kernel rank. In our experiments we compare the speed of all of these samplers for a variety of real-world tasks.
    Deep reinforcement learning under signal temporal logic constraints using Lagrangian relaxation. (arXiv:2201.08504v1 [stat.ML])
    (0 min) Deep reinforcement learning (DRL) has attracted much attention as an approach to solve sequential decision making problems without mathematical models of systems or environments. In general, a constraint may be imposed on the decision making. In this study, we consider the optimal decision making problems with constraints to complete temporal high-level tasks in the continuous state-action domain. We describe the constraints using signal temporal logic (STL), which is useful for time sensitive control tasks since it can specify continuous signals within a bounded time interval. To deal with the STL constraints, we introduce an extended constrained Markov decision process (CMDP), which is called a $\tau$-CMDP. We formulate the STL constrained optimal decision making problem as the $\tau$-CMDP and propose a two-phase constrained DRL algorithm using the Lagrangian relaxation method. Through simulations, we also demonstrate the learning performance of the proposed algorithm.
    Distance-Ratio-Based Formulation for Metric Learning. (arXiv:2201.08676v1 [cs.LG])
    (0 min) In metric learning, the goal is to learn an embedding so that data points with the same class are close to each other and data points with different classes are far apart. We propose a distance-ratio-based (DR) formulation for metric learning. Like softmax-based formulation for metric learning, it models $p(y=c|x')$, which is a probability that a query point $x'$ belongs to a class $c$. The DR formulation has two useful properties. First, the corresponding loss is not affected by scale changes of an embedding. Second, it outputs the optimal (maximum or minimum) classification confidence scores on representing points for classes. To demonstrate the effectiveness of our formulation, we conduct few-shot classification experiments using softmax-based and DR formulations on CUB and mini-ImageNet datasets. The results show that DR formulation generally enables faster and more stable metric learning than the softmax-based formulation. As a result, using DR formulation achieves improved or comparable generalization performances.
    Individual dynamic prediction of clinical endpoint from large dimensional longitudinal biomarker history: a landmark approach. (arXiv:2102.01466v2 [stat.ML] UPDATED)
    (0 min) The individual data collected throughout patient follow-up constitute crucial information for assessing the risk of a clinical event, and eventually for adapting a therapeutic strategy. Joint models and landmark models have been proposed to compute individual dynamic predictions from repeated measures to one or two markers. However, they hardly extend to the case where the complete patient history includes much more repeated markers possibly. Our objective was thus to propose a solution for the dynamic prediction of a health event that may exploit repeated measures of a possibly large number of markers. We combined a landmark approach extended to endogenous markers history with machine learning methods adapted to survival data. Each marker trajectory is modeled using the information collec…
    Convergence of policy gradient for entropy regularized MDPs with neural network approximation in the mean-field regime. (arXiv:2201.07296v1 [math.OC] CROSS LISTED)
    (0 min) We study the global convergence of policy gradient for infinite-horizon, continuous state and action space, entropy-regularized Markov decision processes (MDPs). We consider a softmax policy with (one-hidden layer) neural network approximation in a mean-field regime. Additional entropic regularization in the associated mean-field probability measure is added, and the corresponding gradient flow is studied in the 2-Wasserstein metric. We show that the objective function is increasing along the gradient flow. Further, we prove that if the regularization in terms of the mean-field measure is sufficient, the gradient flow converges exponentially fast to the unique stationary solution, which is the unique maximizer of the regularized MDP objective. Lastly, we study the sensitivity of the value function along the gradient flow with respect to regularization parameters and the initial condition. Our results rely on the careful analysis of non-linear Fokker--Planck--Kolmogorov equation and extend the pioneering work of Mei et al. 2020 and Agarwal et al. 2020, which quantify the global convergence rate of policy gradient for entropy-regularized MDPs in the tabular setting.
    Optimal variance-reduced stochastic approximation in Banach spaces. (arXiv:2201.08518v1 [math.ST])
    (0 min) We study the problem of estimating the fixed point of a contractive operator defined on a separable Banach space. Focusing on a stochastic query model that provides noisy evaluations of the operator, we analyze a variance-reduced stochastic approximation scheme, and establish non-asymptotic bounds for both the operator defect and the estimation error, measured in an arbitrary semi-norm. In contrast to worst-case guarantees, our bounds are instance-dependent, and achieve the local asymptotic minimax risk non-asymptotically. For linear operators, contractivity can be relaxed to multi-step contractivity, so that the theory can be applied to problems like average reward policy evaluation problem in reinforcement learning. We illustrate the theory via applications to stochastic shortest path problems, two-player zero-sum Markov games, as well as policy evaluation and $Q$-learning for tabular Markov decision processes.
    Deep Learning-Accelerated 3D Carbon Storage Reservoir Pressure Forecasting Based on Data Assimilation Using Surface Displacement from InSAR. (arXiv:2201.08543v1 [stat.ML])
    (0 min) Fast forecasting of reservoir pressure distribution in geologic carbon storage (GCS) by assimilating monitoring data is a challenging problem. Due to high drilling cost, GCS projects usually have spatially sparse measurements from wells, leading to high uncertainties in reservoir pressure prediction. To address this challenge, we propose to use low-cost Interferometric Synthetic-Aperture Radar (InSAR) data as monitoring data to infer reservoir pressure build up. We develop a deep learning-accelerated workflow to assimilate surface displacement maps interpreted from InSAR and to forecast dynamic reservoir pressure. Employing an Ensemble Smoother Multiple Data Assimilation (ES-MDA) framework, the workflow updates three-dimensional (3D) geologic properties and predicts reservoir pressure with quantified uncertainties. We use a synthetic commercial-scale GCS model with bimodally distributed permeability and porosity to demonstrate the efficacy of the workflow. A two-step CNN-PCA approach is employed to parameterize the bimodal fields. The computational efficiency of the workflow is boosted by two residual U-Net based surrogate models for surface displacement and reservoir pressure predictions, respectively. The workflow can complete data assimilation and reservoir pressure forecasting in half an hour on a personal computer.
    Spatiotemporal Analysis Using Riemannian Composition of Diffusion Operators. (arXiv:2201.08530v1 [stat.ML])
    (0 min) Multivariate time-series have become abundant in recent years, as many data-acquisition systems record information through multiple sensors simultaneously. In this paper, we assume the variables pertain to some geometry and present an operator-based approach for spatiotemporal analysis. Our approach combines three components that are often considered separately: (i) manifold learning for building operators representing the geometry of the variables, (ii) Riemannian geometry of symmetric positive-definite matrices for multiscale composition of operators corresponding to different time samples, and (iii) spectral analysis of the composite operators for extracting different dynamic modes. We propose a method that is analogous to the classical wavelet analysis, which we term Riemannian multi-resolution analysis (RMRA). We provide some theoretical results on the spectral analysis of the composite operators, and we demonstrate the proposed method on simulations and on real data.
    Non-negative matrix factorization algorithms greatly improve topic model fits. (arXiv:2105.13440v2 [stat.ML] UPDATED)
    (0 min) We report on the potential for using algorithms for non-negative matrix factorization (NMF) to improve parameter estimation in topic models. While several papers have studied connections between NMF and topic models, none have suggested leveraging these connections to develop new algorithms for fitting topic models. NMF avoids the "sum-to-one" constraints on the topic model parameters, resulting in an optimization problem with simpler structure and more efficient computations. Building on recent advances in optimization algorithms for NMF, we show that first solving the NMF problem then recovering the topic model fit can produce remarkably better fits, and in less time, than standard algorithms for topic models. While we focus primarily on maximum likelihood estimation, we show that this approach also has the potential to improve variational inference for topic models. Our methods are implemented in the R package fastTopics.
    Instance-Dependent Confidence and Early Stopping for Reinforcement Learning. (arXiv:2201.08536v1 [stat.ML])
    (0 min) Various algorithms for reinforcement learning (RL) exhibit dramatic variation in their convergence rates as a function of problem structure. Such problem-dependent behavior is not captured by worst-case analyses and has accordingly inspired a growing effort in obtaining instance-dependent guarantees and deriving instance-optimal algorithms for RL problems. This research has been carried out, however, primarily within the confines of theory, providing guarantees that explain \textit{ex post} the performance differences observed. A natural next step is to convert these theoretical guarantees into guidelines that are useful in practice. We address the problem of obtaining sharp instance-dependent confidence regions for the policy evaluation problem and the optimal value estimation problem of an MDP, given access to an instance-optimal algorithm. As a consequence, we propose a data-dependent stopping rule for instance-optimal algorithms. The proposed stopping rule adapts to the instance-specific difficulty of the problem and allows for early termination for problems with favorable structure.
  • Data Science Central

    Is the GPU the new CPU?
    (6 min) I bought my daughter a new computer recently for school. We took the laptop home, then watched, increasingly horrified as out of the box the computer lagged so bad that what should have taken a few minutes for setup took nearly an hour. Back to the store we went, and this time she pointed out… Read More »Is the GPU the new CPU? The post Is the GPU the new CPU? appeared first on Data Science Central.
    Tips for Learning Hadoop in a Month of Lunches
    (4 min) Learning new technologies is a never-ending process. Even the most experienced developer will be familiar with this experience of trying to keep up with the ever-changing technology landscape and it’s shifting trends, buzzwords, and languages. Big Data doesn’t seem to be an exception to this rule. There seems to be a steady stream of new… Read More »Tips for Learning Hadoop in a Month of Lunches The post Tips for Learning Hadoop in a Month of Lunches appeared first on Data Science Central.
    Banking Data | The Best Way to Understand Carbon Footprint
    (4 min) Simply put, a carbon footprint is the number of greenhouse gases that are generated by our everyday actions. The larger our footprint, the more effect it has on global warming. It is virtually impossible to be carbon-neutral. However, we can attempt to reduce it. By doing so, we will contribute to the common goal of saving our… Read More »Banking Data | The Best Way to Understand Carbon Footprint The post Banking Data | The Best Way to Understand Carbon Footprint appeared first on Data Science Central.
    Wearable Technology Demand Continues to Grow
    (3 min) The constantly rising demand for consumer wearable electronic devices for various purposes such as health and heart rate monitoring, pedometers, etc. is bolstering the meteoric rise of the global wearable technology and IOT wearable device market. The solutions and equipment offered by the players and manufacturers operating within the global wearable technology and IOT wearable… Read More »Wearable Technology Demand Continues to Grow The post Wearable Technology Demand Continues to Grow appeared first on Data Science Central.
    Notes to New and Returning DSC Writers
    (2 min) A few weeks ago, we flipped the switch here at Tech Target, retiring our existing Ning-based content management system and debuting our new WordPress environment. Overall we are happy with the way things turned out, but as with any such transition, we’re still working out the kinks. This article covers how to log on for… Read More »Notes to New and Returning DSC Writers The post Notes to New and Returning DSC Writers appeared first on Data Science Central.
  • The Official NVIDIA Blog

    Animator Lets 3D Characters Get Their Groove on With NVIDIA Omniverse and Reallusion
    (3 min) Benjamin Sokomba Dazhi, aka Benny Dee, has learned the ins and outs of the entertainment industry from many angles — first as a rapper, then as a music video director and now as a full-time animator. The post Animator Lets 3D Characters Get Their Groove on With NVIDIA Omniverse and Reallusion appeared first on The Official NVIDIA Blog.
    Vulkan Fan? Six Reasons to Run It on NVIDIA
    (3 min) Many different platforms, same great performance. That’s why Vulkan is a very big deal. With the release Tuesday of Vulkan 1.3, NVIDIA continues its unparalleled record of day one driver support for this cross-platform GPU application programming interface for 3D graphics and computing. Vulkan has been created by experts from across the industry working together Read article > The post Vulkan Fan? Six Reasons to Run It on NVIDIA appeared first on The Official NVIDIA Blog.
  • Becoming Human: Artificial Intelligence Magazine - Medium

    10 AI Solutions for Next-Level Customer Experience
    (5 min) Gone are the days when customer service agents have to deal with tedious customer relationship tasks. With a lot of customer service…
    Impact of Artificial Intelligence on Travel Industry and its Future
    (3 min) Artificial intelligence as a concept has existed for decades though in the current era it is becoming advanced and more reliable enough to…
  • OpenAI

    Introducing Text and Code Embeddings in the OpenAI API
    (4 min) We are introducing embeddings, a new endpoint in the OpenAI API that makes it easy to perform natural language and code tasks like semantic search, clustering, topic modeling, and classification. Embeddings are numerical representations of concepts converted to number sequences, which make it easy for computers to understand the relationships
  • John D. Cook

    Permutations at the command line
    (4 min) Yesterday I wrote about how you could use the tr utility to find the dual of a modal logic expression. Today I’ll show how you could use tr to work with permutations. This post is a hack, and that’s the fun of it. In the logic post we were using tr to swap characters. More […] Permutations at the command line first appeared on John D. Cook.
  • Artificial Intelligence

    I added text prompts from Bible to VQ-GAN and voiceover to create an interesting video taking us through Genesis Chapter I.
    (0 min) submitted by /u/deepank [link] [comments]
    Researchers Build AI That Builds AI - Given a new, untrained deep neural network designed for some task, the hypernetwork predicts the parameters for the new network in fractions of a second, and in theory could make training unnecessary.
    (1 min) submitted by /u/Straight_Finding_756 [link] [comments]
    Machine Learning "Labs" from the undergraduate AI Club at University of Florida (I'm the lecturer)
    (1 min) Just finished Week 3 of Machine Learning Mondays that are "Labs" (this week was more of a lecture) for The GAItor Club at University of Florida. Thought I would post this here just in case anyone is interested learning about Data Science or Machine Learning. I put a lot of time into the lecture this week, so I hope you all like it. Suggestions for improvement are encouraged! :) Live every Monday @ 6:30 Here is the link: https://www.youtube.com/watch?v=1iV_JPqoK48 submitted by /u/kylethe1st [link] [comments]
    [R] New Study Revisits Laplace Approximation, Validating It as an 'Effortless' Method for Bayesian Deep Learning
    (1 min) In the new paper Laplace Redux — Effortless Bayesian Deep Learning, a research team from the University of Cambridge, University of Tübingen, ETH Zurich and DeepMind conducts extensive experiments demonstrating that the Laplace approximation (LA) is a simple and cost-efficient yet competitive approximation method for inference in Bayesian deep learning. Here is a quick read: New Study Revisits Laplace Approximation, Validating It as an 'Effortless' Method for Bayesian Deep Learning. The Laplace code is available on the project’s GitHub. The paper Laplace Redux — Effortless Bayesian Deep Learning is on arXiv. submitted by /u/Yuqing7 [link] [comments]
    Coursera Plus Subscription $100 OFF until 31st of Jan
    (0 min) submitted by /u/biggbrother23 [link] [comments]
    [Presentation] A Model for Language Acquisition
    (0 min) submitted by /u/HyugenAI [link] [comments]
    What field of AI are you specialized in and why?
    (2 min) I am an engineering graduate who has recently gotten hooked on Machine Learning and Computer vision. But I see the field of AI is much much more than that. How did you get in the field you are today, and what has driven you to specialize in that field/sub field ? submitted by /u/Aris_davinci [link] [comments]
    Can I use Googledeepmind to play chess for me?
    (0 min) https://www.youtube.com/watch?v=0g9SlVdv1PY&t=527s submitted by /u/oOMr_StupidOo [link] [comments]
    AlphaFold Artificial Intelligence Powered Drug Discovery of a Novel CDK20 Inhibitor
    (0 min) submitted by /u/Straight_Finding_756 [link] [comments]
    In What ways AI can help Marketing
    (1 min) submitted by /u/futureanalytica [link] [comments]
    Artificial Nightmares : Fiddlesticks the Scarecrow || Pytti VQGAN AI Art Video [4K 60 FPS]
    (1 min) submitted by /u/Thenamessd [link] [comments]
  • Machine Learning

    [R] Transformers in Medical Imaging: A Survey
    (0 min) - Excited to share our latest survey paper on the applications of Transformer models in Medical Imaging by covering more than 125 papers and a diverse set of applications including segmentation, detection, classification, registration, reconstruction, and clinical report generation. - For paper and related github repository please check https://github.com/fahadshamshad/awesome-transformers-in-medical-imaging https://preview.redd.it/h319d84wkwd81.png?width=2970&format=png&auto=webp&s=9e29d359f2ba210507d5902fa08c57cea4114c56 submitted by /u/fahadshamshad3 [link] [comments]
    [D] Want to build an ASR model that trains in the Urdu language and detects Urdu labels. Later translates it into English and create a transcript file.
    (2 min) ​ Hello, I am working on a speech recognition project which gets Urdu speech and translates it into English making a transcripted file of the audio (In English). I know there are tons of libraries that would make my work easy but I want to build it from scratch. I have datasets that I could train my model on it but I am here confused about what steps should I take if I go start building a model from scratch? There is another option if I can tweak changes on the already made model and train my dataset on it accepting Urdu language and identifying labels. Is there any general ASR model that I could work on it? If some of you have built a Speech to text system in your own local language how you have done it? Any research, article, book, code that I can get some help with would be useful. I got the phonemes set and Urdu audio set from Urdu Corpus Phonemes Set but I don't have any info on which model should I train my dataset on it and get the result that I want? If you guys need to ask for more information or something I forgot to tell you please suggest me in the comments. Thank you submitted by /u/username_xyz123 [link] [comments]
    [R] ConvNeXt paper explained with results
    (0 min) Here is a youtube video explaining the latest hot paper, "A ConvNet for the 2020s" Hope its useful: https://youtu.be/OpfxPj2AIo4 submitted by /u/Combination-Fun [link] [comments]
    [N] Easily Build Machine Learning Products
    (2 min) Hey, I’m Merve from Hugging Face, an Open-Source company working in the democratization of responsible Machine Learning. 👋 I used to be an MLE struggling to find my way around which model I should train for the use case I was asked for, and I know there are so many people like me. We've recently launched an Open-Source initiative is called “Tasks”, in which we've curated datasets and benchmarks used for the task, metrics to evaluate the task, models trained on the task, use cases, tutorials and more for every task as well as interactive demos to try out directly in the browser. 🙌🏼 You can get started with learning here. (You can also use hf.co/tasks shortcut. 🤓) I've also made this chatbot that tells you which type of task fits your use case the best. We appreciate feedbacks 🤗 Let us know what you think! submitted by /u/unofficialmerve [link] [comments]
    [R] Sinkformers: Transformers with Doubly Stochastic Attention
    (0 min) submitted by /u/hardmaru [link] [comments]
    [P] Dynamic Time Warping Explained & Implemented in Python
    (0 min) Hi I wrote an about dynamic time warping (DTW) which is a process to compare the similarities between two time series. Check it out if you're interested! https://vatsal12-p.medium.com/dynamic-time-warping-explained-fbb24c1e079b submitted by /u/spidermon97 [link] [comments]
    [R] Dimensionality reduction techniques for multivariate time series?
    (3 min) The title is quite a mouthful, but I'm trying to learn about how a multivariate (high dimensional) time series can be reduced to a lower amount of time series to be used in regression. In the same way that an image can be compressed into a smaller latent space using Autoencoders, I am trying to read up on how these N signals can be compressed into n<<N more salient signals. Say you had 100 EMG signals, what techniques are there that can compress these into, say, 5 signals, which can then be used in turn to estimate a joint angle, like an elbow, finger or knee. Of course I am aware that it may not be as straight forward as I explain it here, but I have to start somewhere. Thanks in advance! submitted by /u/jacksodus [link] [comments]
    [P] feedback on new tool
    (1 min) Hello all, Hope you are well. Thanks to AstraZeneca, I was able to open source some of my work for the past 1 year. The project is named, magnus and relates to data science/engineering pipelines. The project is available here: https://github.com/project-magnus/magnus-core And the docs at: https://project-magnus.github.io/magnus-core/ Could you please share some feedback? I am open to contributions and can use your help in making it better/useful. We are releasing magnus-extensions, an extensions package which makes it cloud ready, soon. Cheers, Team magnus submitted by /u/magnus-pipelines [link] [comments]
    [Project] Real Alchemy - generating a random seed with Chinese GanZhi.
    (1 min) Hi, guys. I found the random seeds sometimes affect the final experimental results and we need a little luck, so I build this project (Github address: random-luck) to select a random seed. This is just for fun with no scientific proof or explanation :) This project provides two interface for now: lucky number by your birth year; number by calculating Chinese Tiangan & Dizhi (or Ganzhi in short), which come from the I Ching, an ancient Chinese method for representing time (wikipedia: Sexagenary_cycle). The program will calculate the Ganzhi of your experiment's start time and return a lucky number. You may use this number as the random seed. Again, there is no scientific explanation and totally alchemy. Hope it may bring a little bit of happiness to our research life. submitted by /u/Spico197 [link] [comments]
    AlphaFold Artificial Intelligence Powered Drug Discovery of a Novel CDK20 Inhibitor
    (1 min) submitted by /u/Straight_Finding_756 [link] [comments]
    [D] Which do you use technology for the deployment of your deep learning project?
    (1 min) Which do you use technology for the deployment of your deep learning project that contains more than one model? What do you think about the "Triton Inference Server"? If you used It, How was your experience? submitted by /u/m-pektas [link] [comments]
    [D] Synthesizing 48 kHz Audio
    (1 min) I am working on a problem where I need to generate 48 kHz stereo (2 channels) audio samples of about 5-10 seconds long based on an existing dataset (~800). I have the following issues: The input audio is of varying lengths. I get around this by padding the samples to the length of the largest. For example, if the samples range from 1 - 10 seconds I pad them all with zeroes so they are all 10 seconds long. I have used WaveGAN to generate audio that sounds decent. However, it only outputs 16 kHz mono samples. There are a few parameters to tweak this but they don't, in my experience, converge nicely. It is also difficult to control the output time of the WaveGAN samples. It would be ideal to specify I want to synthesize a 7 second or 9 second audio sample, for instance. This leaves me with a few thoughts: I am currently trying MelGAN and WaveGrad but I don't hold out much hope for them to alleviate the above issues (even if they create slightly better output). Maybe I should be outputting 16 kHz audio and using audio super-resolution to upsample to 48 kHz. Has this been done before in an accessible library? I am wary to build my own model based on WaveGAN because I believe 48 kHz might be too many parameters to learn in a reasonable amount of time or based on my sampled data. If anyone has played in this space before I would love to hear your advice! submitted by /u/oflagelodoesceus [link] [comments]
    [D] Anyone else feel "sad" that even a child now can easily create and train a model because of how easy it has become?
    (6 min) Dont get me wrong, it is amazing how simple making and training a model has become, but I remember in my young days (like 2-3 years ago...) that I struggled so much to actually understand backpropagation and at some point I decided to just code everything from scratch (with numpy tho, I still love myself a little) and I got such a deeper understanding, I feel like nowadays you just model.fit() or model.train_on_batch().... I wanna work in ML, but I have this feel machines are gonna take my job since its so simple, which I dont have yet.... To be frank, I dont feel "sad" about how simple ML's library as become, I feel sad about most things. Like I coulda use model.fit() "back in my days" and pretend to understand why we use derivatives during backprop. I just get so damn sober looking and using the model.fit() function that has literal decades worth of intense engineers brain juice poured into that simple 3 letter function, like damn, will it someday be the last function call a human do? EDIT: I am no ML engineer because I can use kaggle well for my "hubby" projects, and thus should not have posted something where I pretend to know stuff about ML. thank you for making that clear ML community of reddit submitted by /u/depressedPOS-plzhelp [link] [comments]
    [Discussion] Pre-trained style transfer with sub-1-second processing?
    (1 min) I'm looking at STROTSS, WCT and ADAIN for low-latency style transfer in a cloud platform. Are these or any other solutions suitable to achieve sub-1-second throughput on low-resolution images (256x256)? submitted by /u/dataden [link] [comments]
    [D] How would you define a "model" of how the information is being processed by a deep net?
    (1 min) Let's say I have: - An input - One or more NNs that process the input, not necessarily in a feedforward way (and perhaps in the context of RL even doing attention on the input while taking queries from the policy). - An output Now, if I asked you to come up with a model that is both descriptive and predictive of how the aforementioned mechanisms work, what would you do? The answer in my head is not clear. It goes from solutions like "a neural network that learns to predict the output given the input" to more ideas like "a dictionary with features that are saying something about those mechanisms". TL;DR how do you create a good descriptive and predictive model of how information is processed by a nonlinear function? submitted by /u/No_Possibility_7588 [link] [comments]
  • Neural Networks, Deep Learning and Machine Learning

    One class classification
    (1 min) I have pretty tough task to make one class neural network classifier. The problem is that the images of different classes are looking extremely similar, and the anomaly is not visible to human eye, so I understand why my 1C-NN doesn’t work(interesting that 2class classifiers perform good). But I need some solution for one-class classification. Maybe somebody can help with any ideas? submitted by /u/dazira [link] [comments]
    Racism detector
    (1 min) How easy is it to make one using Neural Networks? Obviously a string like "nigg" should lead to an instaban. But, sometimes racism can be subtle. So-called "dog whistles". Viable? As an aside, do you guys think Tay (?) the Microsoft Twitter AI was an actual "AI" (even then what kind?) or a PR stunt with a crew of "trolls" running the page? submitted by /u/aefghag [link] [comments]
  • Reinforcement Learning

    "Researchers Build AI That Builds AI: By using hypernetworks, researchers can now preemptively fine-tune artificial neural networks, saving some of the time and expense of training"
    (1 min) submitted by /u/gwern [link] [comments]
    Is TD3 Unstable?
    (1 min) TD3 sure is an improvement over DDPG and it also learns faster from my own testing. However, in training TD3 on Bipedal-Walker (from OpenAI Gym), I've found that on some trials, the agent's performance just plummets and it pretty much has to relearn all over again. Here's an example trial below: After every training update, a test episode is conducted and the cumulative reward of that episode is graphed. I haven't been able to find anything online regarding "instability" of TD3, so it may be an issue with my implementation. Anyone have any insight? Here's my source code: https://github.com/hmomin/TD3-Bipedal-Walker submitted by /u/hmomin [link] [comments]
    Alternatives to Unity3D for simulating 3D environments with realistic physics for robotics and training a reinforcement learning model?
    (1 min) Hi, Thanks to this community, I discovered that Unity3D provided a framework for robotics that enables to train reinforcement learning in 3D environments with realistic visuals and physics. https://unity.com/solutions/automotive-transportation-manufacturing/robotics It seems to fit pretty well my need for my project. Robotics and physics are needed, as well as realistic rendering, for computer vision models. I wanted to know if there are other similar solutions that I shall explore. So far I found PyBullet, RobotPy, RobotDK, SOFA, and some others, but I wonder if there is something that is comparable or better than Unity 3D for this specific use case. Thanks submitted by /u/x11ry0 [link] [comments]
    [D]agent always taking sub-optimal action with high -ve reward in actor-critic
    (2 min) hi! My question is related slightly to a similar post i made recently (https://www.reddit.com/r/reinforcementlearning/comments/s7ptys/d_ddpg_not_converging_were_actor_critic_and_dqn/hted26u/?context=3) . I am training an actor-critic using a keras implementation which I have tested and works well on cartpole (although a simple environment, I dont think there are specific bugs in the implementation) - my task is a graph optimisation in which the agent makes or removes connections on an adjacency matrix (encoded as 0 or 1 for no connection or connection respectively) - it is a multi-step environment so for every step (as per my linked post) the observation is accompanied with a one-hot vector to indicate the selection the agent made - as follows: select node, return one hot vector of node1…
    Custom observation space in environment
    (1 min) Hey everyone, I've started working on the implementation of a custom multiagent environment in Petting Zoo. However, I've gotten a bit stuck. My observation space can be modeled best as a graph. Hence, I would rather not use something like the "Box" for this, but create my own graph-based observation space. The algorithm I'll use will then be equipped with the necessary techniques to deal with this type of input. My first idea is to represent the graph as an adjacency matrix, and just concatenate the vertex/edge features to the space (then I could use a combination of Discrete/Box). Is this the recommended thing to do, or is there a better way of going about it? One of the nice things about Petting Zoo is the simplicity by which you can train pre-implemented agents. I think this should still be possible by adding the necessary observation preprocessing and architecture as a part of the neural networks. However, I don't have much experience with this, so if anyone could give their opinions on this approach I'd be very grateful. Thank you in advance for your reply! submitted by /u/justLars7D1 [link] [comments]
    Huge Step in Legged Robotics from ETH ("Learning robust perceptive locomotion for quadrupedal robots in the wild", Miki et al 2022)
    (1 min) submitted by /u/gwern [link] [comments]
    I can hardly understand that SARSA follows the Bellman Expectation Equation
    (1 min) TSIA I can understand if "derived from Q" (in the pseudo code) means sampling, however, it means argmax. where can I get clues for 'expectation'? https://preview.redd.it/35jhxvfcjqd81.png?width=350&format=png&auto=webp&s=b9ea07697523fd1685da3686527b19611deca726 submitted by /u/ad26kr [link] [comments]
    How are entity embeddings created for RL
    (1 min) Hello, In some works, like this one, entity embeddings are used to represent certain objects in a RL task (agent, obstacles, etc). How are these embeddings created? Are they trained as part of the RL training process, or could be pre-trained? submitted by /u/AhmedNizam_ [link] [comments]
    How would you define a "model" of how the information is being processed by a deep net?
    (1 min) Let's say I have: - An input - One or more NNs that process the input, not necessarily in a feedforward way (and perhaps in the context of RL even doing attention on the input while taking queries from the policy). - An output Now, if I asked you to come up with a model that is both descriptive and predictive of how the aforementioned mechanisms work, what would you do? The answer in my head is not clear. It goes from solutions like "a neural network that learns to predict the output given the input" to more ideas like "a dictionary with features that are saying something about those mechanisms". TL;DR how do you create a good descriptive and predictive model of how information is processed by a nonlinear function? submitted by /u/No_Possibility_7588 [link] [comments]

2022-01-24

  • Google AI Blog

    Accurate Alpha Matting for Portrait Mode Selfies on Pixel 6
    (8 min) Posted by Sergio Orts Escolano and Jana Ehman, Software Engineers, Google Research Image matting is the process of extracting a precise alpha matte that separates foreground and background objects in an image. This technique has been traditionally used in the filmmaking and photography industry for image and video editing purposes, e.g., background replacement, synthetic bokeh and other visual effects. Image matting assumes that an image is a composite of foreground and background images, and hence, the intensity of each pixel is a linear combination of the foreground and the background. In the case of traditional image segmentation, the image is segmented in a binary manner, in which a pixel either belongs to the foreground or background. This type of segmentation, however, is unab…
    Separating Birdsong in the Wild for Classification
    (7 min) Posted by Tom Denton, Software Engineer and Scott Wisdom, Research Scientist, Google Research Birds are all around us, and just by listening, we can learn many things about our environment. Ecologists use birds to understand food systems and forest health — for example, if there are more woodpeckers in a forest, that means there’s a lot of dead wood. Because birds communicate and mark territory with songs and calls, it’s most efficient to identify them by ear. In fact, experts may identify up to 10x as many birds by ear as by sight. In recent years, autonomous recording units (ARUs) have made it easy to capture thousands of hours of audio in forests that could be used to better understand ecosystems and identify critical habitat. However, manually reviewing the audio data is very time co…
  • Data Science Central

    What Are Lagrange Points?
    (3 min) The James Webb Telescope, the long-awaited successor to the Hubble Telescope, was launched on Christmas Day, December 25th, 2021 and has, over the course of several weeks, been heading out to its final destination, an odd point in space called L2. While Lagrange points are well known to astronomers, they, and why the are important,… Read More »What Are Lagrange Points? The post What Are Lagrange Points? appeared first on Data Science Central.
    Data Literacy Education Framework – Part 1
    (6 min) What do we need to do to increase the data literacy of our organization? In a world where your personal data, and the preferences and biases buried in that data, are being used to influence your behaviors, beliefs, and decisions, data literacy becomes an indispensable fundamental skill.  And it’s not just corporations that need this… Read More »Data Literacy Education Framework – Part 1 The post Data Literacy Education Framework – Part 1 appeared first on Data Science Central.
    How Cyber Security Consulting Can Prevent Attacks in BFSI’s Digital Era
    (3 min) Organizations in the cyber security counseling market have been profiting by revenue openings during the COVID-19 episode, as the pandemic changed the cyber-threat scene, prompting a blast of COVID-related assaults. Notwithstanding, the financial circumstance has been pressed for some undertakings and associations, which is causing faltering for interest in counseling administrations. By the by, market… Read More »How Cyber Security Consulting Can Prevent Attacks in BFSI’s Digital Era The post How Cyber Security Consulting Can Prevent Attacks in BFSI’s Digital Era appeared first on Data Science Central.
    Top 10 Most In-Demand Skills in the World of Data Science
    (4 min) Hindsight is always 20/20, but someone needs to be looking into the future before it gets here. That’s the role of a data scientist, and in order to do that, they need a ton of skills at their disposal. Here are 10 in-demand skills in the world of data science that will help you get… Read More »Top 10 Most In-Demand Skills in the World of Data Science The post Top 10 Most In-Demand Skills in the World of Data Science appeared first on Data Science Central.
    Can Digital Twins Prevent a Pandemic?
    (4 min) Digital Twins may be the future of healthcare. Remote working and early warning systems can tackle pandemics. Deploying Medical Cyber-Physical Systems faces many challenges. Digital twins—virtual replicas of assets or systems—have traditionally been used for engineering and construction [1]. However, there’s a new use case on the horizon: combatting the current covid-19 pandemic crisis as… Read More »Can Digital Twins Prevent a Pandemic? The post Can Digital Twins Prevent a Pandemic? appeared first on Data Science Central.
  • The Official NVIDIA Blog

    Meta Works with NVIDIA to Build Massive AI Research Supercomputer
    (3 min) Meta Platforms gave a big thumbs up to NVIDIA, choosing our technologies for what it believes will be its most powerful research system to date. The AI Research SuperCluster (RSC), announced today, is already training new models to advance AI. Once fully deployed, Meta’s RSC is expected to be the largest customer installation of NVIDIA Read article > The post Meta Works with NVIDIA to Build Massive AI Research Supercomputer appeared first on The Official NVIDIA Blog.
    How the Intelligent Supply Chain Broke and AI Is Fixing It
    (4 min) Let’s face it, the global supply chain may not be the most scintillating subject matter. Yet in homes and businesses around the world, it’s quickly become the topic du jour: empty shelves; record price increases; clogged ports and sick truckers leading to disruptions near and far. The business of organizing resources to supply a product Read article > The post How the Intelligent Supply Chain Broke and AI Is Fixing It appeared first on The Official NVIDIA Blog.
  • MIT News - Artificial intelligence

    3 Questions: Anuradha Annaswamy on building smart infrastructures
    (5 min) Senior research scientist and her team are designing intelligent systems that could someday transform the way we travel and consume energy.
  • John D. Cook

    Look up HTML entity or Unicode
    (1 min) Fairly often I want to find out whether there is an HTML entity for a given Unicode character, or given an HTML entity I want to look up its Unicode value. For example, ∃ has Unicode value U+2203. I might want to look up whether there is an HTML entity for this. There is, and […] Look up HTML entity or Unicode first appeared on John D. Cook.
    Complex analysis calendar
    (1 min) My recent post Reading plots of a complex function was based on ideas from a book by Elias Wegert. Thanks to a comment on that post I discovered that Wegert and his colleagues have put out a 2022 calendar with images created from plots of complex functions. This year’s calendar, and previous calendars going back […] Complex analysis calendar first appeared on John D. Cook.
    Dual axioms in modal logic
    (3 min) Axioms in modal logic often say that one sequence of boxes and diamonds in front of a proposition p implies another sequence of boxes and diamonds in front of p. For example, Axiom 4 says □ p → □□ p and Axiom 5 says ◇p → □◇p. Every axiom has a dual form. The dual […] Dual axioms in modal logic first appeared on John D. Cook.
    Word problems, logic, and regular expressions
    (2 min) Word problems Suppose you have a sequence of symbols and a set of rewriting rules for replacing some patterns of symbols with others. Now you’re given two such sequences. Can you tell whether there’s a way to turn one of them into the other? This is known as the word problem, and in general it’s […] Word problems, logic, and regular expressions first appeared on John D. Cook.
  • Artificial Intelligence

    Understanding No-Code AI And Top No-Code Machine Learning (ML) Tools For Some DIY ML Projects
    (0 min) submitted by /u/techsucker [link] [comments]
    What is Meta's New AI Supercomputer?
    (0 min) submitted by /u/Beautiful-Credit-868 [link] [comments]
    Ai undergrad class, worthwhile?
    (1 min) I’m currently a sophomore cs major, my school offers an undergrad ai class, taught by one of our ai researcher professors, At this point I’d say I’m more of a passive observer in the ai seen (more familiar than a layman, but not familiar with too much technical knowledge) do you guys think this class would be worth taking at all? Im not so convinced that anything deep could be taught in 1 semester, but I’m very interested in ai so I’m curious to hear what you think submitted by /u/Pure_Appointment5355 [link] [comments]
    Last Week in AI - Self-driving trucks in Texas, AI fighter pilots, robots to help in nursing staff shortage, and more!
    (1 min) submitted by /u/regalalgorithm [link] [comments]
    Following a successful run in Malta this last November, our AIBC Pitch Competition is leading forward to Dubai!
    (1 min) submitted by /u/thedyezwfl [link] [comments]
    Meta's Massive New AI Supercomputer Will Be 'World's Fastest'
    (0 min) submitted by /u/oliverpeckham [link] [comments]
    European and UK Deepfake Regulation Proposals Are Surprisingly Limited
    (0 min) submitted by /u/DaveBowman1975 [link] [comments]
    An AI system that thinks fast and slow
    (0 min) submitted by /u/bendee983 [link] [comments]
    N-queens in python
    (0 min) Hello! I'm beginner in AI studies. Does anyone have a python code solving the N-queens problem using the iterative deepening search algorithm (IDS) ? submitted by /u/gwuyn [link] [comments]
    What is the main benefit of using AI-powered solutions in manufacturing?
    (2 min) View Poll submitted by /u/futureanalytica [link] [comments]
    I’m trying to add conversational AI into my in-house Drive Thru kiosk app, has anyone developed anything similar?
    (1 min) Hello everyone, This is my current Drive Thru Kiosk setup. I'm upgrading to new Kiosks in Q2, but also want to add conversational AI. I'm trying to build a conversational AI for my restaurants Drive Thru Kiosks that I have been working on for the last year. One of the biggest complaints from customers is that they don't want to touch the kiosk and want traditional Drive Thru Ordering experience. I was wondering if anyone has built a similar solution. I was taking a look at the following video and it's similar to what I want to build: https://www.youtube.com/watch?v=KQQEcpwowcM&ab_channel=WomenTechmakers Has anyone else developed something similar? submitted by /u/Tastiollc [link] [comments]
  • Machine Learning

    [D] CVPR Reviews are out
    (1 min) This particular submission of mine, has gone through 3 submission cycles before CVPR. We pulled out 2 times, didn't submit, and the third time it got rejected by a small margin at ICCV (we had 1 strong and 1 weak accept back then, with 2 weak rejects) I spent the next 6 months, upgrading the method and results, addressing every point ICCV reviewers raised. Submitted to CVPR. Got 2 WA and a WR. Somehow without a Strong Accept it seems I got worse ratings than I did at ICCV for a much stronger paper. Surprisingly Reviewer 2 this time has a WA. The WR has raised some valid points, I don't think I can revert their decision easily. What do yours look like? submitted by /u/nefrpitou [link] [comments]
    [R] LaMDa paper released, 137B params, 1.56T words - specialized for dialog
    (1 min) Blog: https://ai.googleblog.com/2022/01/lamda-towards-safe-grounded-and-high.html Paper: https://arxiv.org/abs/2201.08239 From my skim, impressive results IMO perhaps due to their human-sounding answers. But the main focus seems to be bias and safety.... submitted by /u/Competitive-Rub-1958 [link] [comments]
    OpenAI Residency 2022 [R]
    (1 min) Hi all, I am starting this thread for everyone who has applied to the OpenAI AI Residency in 2022. The applications have closed on the 14th of Jan 2022. Has anyone heard back from OpenAI? Thank you for this and wishing everyone the best of luck with their applications. :) submitted by /u/macro-2018 [link] [comments]
    [R] Huge Step Forward in Legged Robotics from ETH
    (1 min) https://www.youtube.com/watch?v=zXbb6KQ0xV8 Control policies learned via RL are starting to work in the real world! Typically policies learned via simulation tend to transfer poorly to the real world (the so-called sim2real gap), so I'm curious to dig into this work to see how they overcame this limitation. From just watching the video and guessing, it would make sense if noising the belief state (rnn(h,concat(proprio,extero)) + \eps ~ Noise) and learning to condition proprioceptive attention on the belief uncertainty is enough. Very cool work and so exciting to see robotics groups exploiting ML more and more (gate attention + learned belief states here). submitted by /u/AristocraticOctopus [link] [comments]
    [D] AAAI/IJCAI for ML and related papers
    (4 min) Now this could be a controversial post, but I am looking for some discussions. AAAI and IJCAI are often considered as "not as good" places for ML and related papers. For example, not as good as ICML et al for pure ML papers, not as good as ACL et al for NLP papers, not as good as CVPR et al for CV papers. Even robotics people mostly favor ICRA/IROS/RSS and the new CoRL, even though the first two conferences typically have much higher acceptance rate (if that indicates anything) than AAAI/IJCAI (~40% vs ~20%). RSS and CoRL tend to have ~30% acceptance rate. Lets call the "competitor conferences" as "specialized venues" for now. So, what's your experience reading, reviewing, and publishing papers at these venues? In reading, do you tend to find AAAI/IJCAI papers to be less good than th…
    Who creates the ML Models? Data Scientists or ML Engineers? [D]
    (2 min) Who creates the preliminary models like xgboost, random forest, etc? I used to think ML engineers did that but was told by an ML engineer that the Data scientists at his company create the models & then he gets them ready for production. So is this typically the case or does it just depend on the company? submitted by /u/Careless_Drummer_208 [link] [comments]
    [R] Finding Errors in Perception Data With Learned Observation Assertions
    (1 min) ML models are increasingly being deployed in mission-critical settings, such as autonomous vehicles, but shockingly the data used to train these models are rarely checked! For example, the Lyft Level 5 dataset has errors in 70% of the validation scenes (see the images). Bad data can lead to bad models! Related work shows that bad data can effectively reduce model capacity by 3x (Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks)! See our blog post for more details: https://dawn.cs.stanford.edu/2022/01/24/loa/ We developed LOA to find such errors in perception data (accepted to SIGMOD 2022). We deployed LOA over the Lyft Level 5 perception dataset and successfully found errors in every validation scene with an error. We've open-sourced our code for general use (https://github.com/stanford-futuredata/loa) and have more details in our blog post (https://dawn.cs.stanford.edu/2022/01/24/loa/) and full paper (https://arxiv.org/abs/2103.14749) https://preview.redd.it/qfbfp4ymeod81.png?width=400&format=png&auto=webp&s=c517205c021fc6f07c4fe4c27ee5ad6bd1b0a31f https://preview.redd.it/ybayh5ymeod81.png?width=400&format=png&auto=webp&s=c7875930f5d49516b07e442a1bb29abc053fbd5d submitted by /u/danieldkang [link] [comments]
    [P] Awesome List - Objective Function Optimizer
    (0 min) A curated list of methods for optimizing an objective function. This collection aims to put together references to various optimization methods. https://github.com/zoq/Awesome-Optimizer PR's are more than welcome! submitted by /u/eusben [link] [comments]
    [D] Looking for a dataset with images of same scenery with different resolutions
    (1 min) Is there a data set that has groups of images where each group shows a scenery with different resolutions (up to even 20MP)? I want to experiment with how some semantic segmentation networks work on images with different resolutions and thus check what would be the best way to process them before putting them into the trained network for inference (cropping or whole image?). I don't want to resize images because that would add artificial information. submitted by /u/jthat92 [link] [comments]
    [D] Does NeurIPS accept surveys?
    (0 min) Has anyone got experience with submitting surveys at Neurips? They don't seem the kind of conference prioritising surveys. submitted by /u/Conscious_Heron_9133 [link] [comments]
    [R] Paper Summary: Masked Autoencoders Are Scalable Vision Learners
    (1 min) I have just published a new article in the medium that is associated with computer vision. It is a considerable attempt to reduce computational costs and consequently time consumption. In this research, random patches of the image were removed to make training simpler with efficiency which makes it scaleable in computer vision. It includes an asymmetric encoder-decoder architecture (Also, I have added Python code for both main subsets of architecture (encoder and decoder)). This pre-training model is really practical in real-world computer vision. Hope you give it a thorough reading and consider it in your own projects. If my articles sound interesting to you, you can follow me on medium. :) https://medium.com/mlearning-ai/paper-summary-masked-autoencoders-are-scalable-vision-learners-2dea8cdb1884 submitted by /u/rezayazdanfar [link] [comments]
    [N] Awesome Metric Learning
    (2 min) The Metric Learning approach to data science problems is heavily underutilized. There are a lot of academic papers around it but much fewer practical guides and tutorials. So we decided that we could help people adopt metric learning by collecting related materials in one place. We are publishing a curated list of awesome practical metric learning tools, libraries, and materials - https://github.com/qdrant/awesome-metric-learning This collection aims to put together references to all required materials for building your application using Metric Learning. PR's are more than welcome! submitted by /u/devzaya [link] [comments]
    [D] Which MLOps platform can you recommend?
    (1 min) Running ML workloads on a large pool of machines can get tedious. The solution seems to be to run them on a cluster instead (e.g. Kubernetes). However, managing jobs on a cluster manually is difficult for a novice. Numerous MLOps platform have been springing up to address this. Here are some I found: Algorithmia clear.ml cnvrg.io Dataiku Datarobot Flyte Iguazio JetBrains Datalore Kubeflow Metaflow Paperspace Gradient seldon.io Valohai Which MLOps solution do you use and can recommend? I'm looking for something that can be hosted on-prem and is preferably inexpensive/free to use. In terms of features, something that will allow experimentation and collaboration by a team. submitted by /u/postrational [link] [comments]
    [D] Can we define the expiry date of a Machine Learning model?
    (2 min) There is a lot of literature on model drift, data drift, model retraining etc. But can we really define an expiry date for an ML model? If so what are the factors it will depend on? submitted by /u/Pranaymodukuru [link] [comments]
    Improving Unsupervised Dialogue Topic Segmentation with Utterance-Pair Coherence Scoring (Summary)[D]
    (1 min) Dialogue topic segmentation is critical in several dialogue modeling problems. However, unlike popular unsupervised approaches that only exploit surface features, authors in this paper train BERT model by automatically generating training data and calculating pretty accurate utterance-pair coherence scoring system. Video Summary: https://youtu.be/i8WUJZk5_I8 Paper: https://arxiv.org/pdf/2106.06719.pdf submitted by /u/prakhar21 [link] [comments]
    Has there been any work into the effectiveness of teaching ‘rotational extrapolation’ as a mechanism for boosting the effectiveness of image recognition?[D]
    (2 min) Intuitively, one reason it’s easy for people to recognize objects with relatively few samples, is that we have some notion of how to extrapolate the appearance of objects under some non-deformative manipulations. A child can imagine what an object might look like were it to be rotated without having seen it before. Has there been any work at teaching NN how to ‘imagine’ the appearance of objects subject to these kinds manipulations, as to boost there general effectiveness at image recognition? submitted by /u/Greenface1998 [link] [comments]
    [D] Is a master's degree in machine learning/AI useful?
    (1 min) I think we all know that to dedicate yourself to machine learning you don't need a master's degree, experience and a university degree is more than enough, but my question is focused on something else: Question 1: In countries like Canada or the United States and having a master's degree, can it influence the salary you receive? Question 2: Does it influence in any way that you are hired by better companies? Question 3: Is a master's degree abroad validated in the United States or Canada worth the same as one of those countries for companies from those countries? Question 4: Based on the answers above, would you recommend doing one? submitted by /u/Garluz [link] [comments]
  • Neural Networks, Deep Learning and Machine Learning

    An Idea
    (1 min) I am not sure whether it is possible or not, But is it by any chance possible to create a neural network that points out design flaws in websites or in magazines? If yes, then what is an approach that I can try? Thanks. submitted by /u/stealth_master1 [link] [comments]
    Underfitting vs. Overfitting and ways to overcome them | Kolbenkraft
    (0 min) submitted by /u/cjmodi306 [link] [comments]
    What is grokking in machine learning?
    (0 min) submitted by /u/analyticsindiam [link] [comments]
    Can somebody ELI5 Yang's method as discussed in his paper on two-conductor transmission line equations?
    (1 min) I am studying his paper for a college assignment and I am struggling to comprehend what his neural network method entails. Can somebody really dumb it down for me? This is the paper in question: https://aip.scitation.org/doi/full/10.1063/1.5025504 submitted by /u/prowl16 [link] [comments]
  • Reinforcement Learning

    RayEnvWrapper: Vectorize your environment using Ray
    (1 min) I wrote a small environment wrapper to parallelize gym environments using Ray. Not only does it allow to run environments in parallel, but it also permits to run multiple sequential environments on each worker. It could be useful for those of you who, like me, prefer to use Ray than the build multiprocessing and also need to run multiple sequential environments per worker. I needed it for one of my projects, maybe it will help others too. Github: https://github.com/ingambe/RayEnvWrapper submitted by /u/ingambe [link] [comments]
    Continuing task broken into episodes
    (2 min) I want to train an RL agent on a continuing task (there is no start and end), but I can only simulate a fixed amount of steps. Therefore, I need to train the agent simulating several pieces of tragectories. Now, in the common episodic task, I would learn the value function using the target y_t = r_t + gamma * V(s_t+1) and, for the last step of the episode, y_T = r_T. However, in my case, there is no "last step", it's just the step before simulation was stopped. Should I then always use the first bellman eq.? (Think of the example problem of a network node that needs to balance loads between connected servers: it is a continuing process that never ends) submitted by /u/fedetask [link] [comments]
    Reinforcement learning for streaming entity pairs labeling.
    (1 min) Hi Everyone, I am currently working on Entity Matching for Streaming Data. The training dataset is not available in this job. The input stream is a pair of entities, as illustrated in the table below. My objective is to create a model that labels entity pairs as a "match" or "not match". To get the true label for the entity pairs, active learning is used. Name Brand Address entity_1 Apple iPhone 8 plus CA ​ Name Manufacturer Location Price entity_2 iPhone 8p Apple California 699.9 Since I do not have the training data, should I use Reinforcement Learning for this task? If I can use it, can someone please assist me in how to design the RL model to address the above problem? submitted by /u/suhasds2910 [link] [comments]
    Is DQN with experience delay a thing?
    (1 min) Hi! I am currently working on my master thesis on Mobile Edge Computing where I follow a paper called "Deep Reinforcement Learning based Computation Offloading and Resource Allocation for MEC" by Ji Li, Hui Gao, Tiejun Lv, and Yueming Lu (published in IEEE in 2018) that proposes a Deep Reinforcement Learning scheme to find optimal resource allocations. The Deep Reinforcement Learning method they use in the paper is called "Deep Q-learning with Experience Delay". However, I cannot understand where they found this algorithm? I have tried googling around, but with no results and they don't cite where they have found it either. Does anybody have a clue about who might have created this or if it is possible to implement it? https://preview.redd.it/8apiubjr8md81.png?width=1034&format=png&auto=webp&s=625b5dd791799c5d6fc3dcc142c34a50442d210b submitted by /u/Sondreeo [link] [comments]
    Change learning rate of a saved model
    (2 min) Hi there, I playing around with RL for the past few weeks/months and I already have prototype model. The past days I have trained my model for a few hours at a learning rate of 0.0001, and saved the model after training was finished. Now I want to load my model and change the learning rate, so that I can continue my model training but at a different learning rate. Right now I am using stable baselines. As far as I understand it is possible to change the learning rate of a loaded model, by using set_parameters. Currently I am loading the model using 'load()' and I am trying to update the learning rate using 'set_parameters'. According the to docs this should work (if I understand this correctly): Load the model from a zip-file. Warning: load re-creates the model from scratch, it does …
    r/reinforcementlearning Subdirect Statistics
    (0 min) submitted by /u/lochydjango [link] [comments]
    Has anyone tried to train RL with HandManipulatedBlock-v0 from OpenAI Gym?
    (1 min) Update: After searching around, it seems that a successful training on HandManipulatedBlock-v0 env will require a ridiculously large amount of data (3-100 years) according to this blog post from Open AI. https://openai.com/blog/learning-dexterity/ submitted by /u/Rich_Beautiful [link] [comments]

2022-01-23

  • Data Science Central

    Cloud certifications – pros and cons
    (3 min) I read a very interesting article by McKinsey last week Six practical actions for building cloud talent One statement about cloud certifications stand out The most valuable cloud engineers and developers in many established enterprises don’t necessarily have loads of certifications. Instead, they bring extensive experience in IT-infrastructure organizations and have at least five years… Read More »Cloud certifications – pros and cons The post Cloud certifications – pros and cons appeared first on Data Science Central.
  • Artificial Intelligence

    Artificial Nightmares : Godric the Golden || Pytti VQGAN AI Art Video [4K 60 FPS]
    (0 min) submitted by /u/Thenamessd [link] [comments]
    What companies/organizations/websites/communities are there that list and classify AIs?
    (1 min) If possible, I enter a search term such as image processing and then I get all the AIs that have anything to do with image processing. submitted by /u/xXLisa28Xx [link] [comments]
    What companies/organizations/websites/communities are there that list and classify AIs?
    (0 min) If possible, I enter a search term such as image processing and then I get all the AIs that have anything to do with image processing. submitted by /u/xXLisa28Xx [link] [comments]
    What companies/organizations/websites/communities are there that list and classify AIs?
    (0 min) If possible, I enter a search term such as image processing and then I get all the AIs that have anything to do with image processing. submitted by /u/xXLisa28Xx [link] [comments]
    Artificial Nightmares : Lords of Cinder || Pytti VQGAN AI Art Video [4K 60 FPS]
    (0 min) submitted by /u/Thenamessd [link] [comments]
    Deploy ML models using AWS Lambda and Github Actions - A detailed tutorial
    (1 min) I wrote a step-wise tutorial to demonstrate the steps required to deploy an ML model using AWS Lambda, GitHub Actions, AWS API Gateway and using Streamlit to access the model API through a UI. Check out the blog here - https://shreyansh26.github.io/post/2022-01-23_model_deployment_using_aws_lambda/ submitted by /u/shreyansh26 [link] [comments]
    Subtitles in Facebook videos
    (1 min) Iam from a non-English speaking country (Malayalam language ,Kerala, India) and it was hard to find a good generated subtitles that good in videos like in YouTube. But fb videos have been showing really impressive and accurate translated subtitles (Malayalam>English) than any. Even with noicy videos. How could they improve this much, where as YouTube still pretty much bad with even though having more amount of data and information!! ( I don't know about other languages or whether this is just a case with my language!!) submitted by /u/Akulan_ [link] [comments]
    My Keyword Ratings for GAN
    (0 min) submitted by /u/DistributionOk352 [link] [comments]
    Facebook's Data2vec is an AI that can Learn in Multiple Modes
    (0 min) submitted by /u/Beautiful-Credit-868 [link] [comments]
    Optimization of CUDA Elementwise Template Library: Practical, Efficient, and Extensible
    (1 min) submitted by /u/Just0by [link] [comments]
    What AIs are there which are able to render/generate realistic images?
    (0 min) submitted by /u/xXLisa28Xx [link] [comments]
    What games has AI already mastered?
    (1 min) I know AI mastered chess and some computer games, but any card games or something? submitted by /u/TheblackRook3 [link] [comments]
    Artificial Nightmares : God King Darius || Pytti VQGAN AI Art Video [4K 60 FPS]
    (0 min) submitted by /u/Thenamessd [link] [comments]
    Deepmind Researchers Propose ‘ReLICv2’: Pushing The Limits of Self-Supervised ResNets
    (1 min) The supervised learning architectures generally require a massive amount of labeled data. Acquiring this vast amount of high-quality labeled data can turn out to be a very costly and time-consuming task. The main idea behind self-supervised methods in deep learning is to learn the patterns from a given set of unlabelled data and fine-tune the model with few labeled data. Self-supervised learning using residual networks has recently progressed, but they still underperform by a large margin corresponding to supervised residual network models on ImageNet classification benchmarks. This poor performance has rendered the use of self-supervised models in performance-critical scenarios till this point. To address the poor performance of self-supervised models, a team of researchers from DeepMind has recently proposed a new method called RELICv2 to improve the performance of the self-supervised methods and consistently achieve a better performance than baseline supervised learning benchmarks. The proposed RELICv2 is based on the past RELIC framework, which stands for “Representation Learning via Invariant Causal Mechanisms” which uses the principle of invariant prediction for representation learning. In this manner, the RELICv2 (RELIC version 2) model can incorporate RELIC’s capability of solving downstream classification tasks. Continue Reading Paper: https://arxiv.org/pdf/2201.05119.pdf submitted by /u/techsucker [link] [comments]
    Help! Understanding AI using Games
    (1 min) Hello, I am a college student studying Business Informatics from Germany. I have to write a 20-25 page paper on Understanding AI using games due in 6 days. I have no knowledge on AI. I wanted to ask the community, If anybody could like give me a stucture for the Paper, WHAT SHOULD I INCLUDE? WHAT ALGORITHMS? TYPES OF AI? A TABLE OF CONTENT WOULD BE REALLY HELPFUL. Google is flooding me with a lot of information and I am confused. The focus can be laid on any games, simple or complex. All I am asking is for a Table of Content or tips. In summary, I mean all the topics I need to explain in my paper, so that the reader can understand AI via Games I would greatly appreciate your input and experience. Thank you very much 😊 submitted by /u/Far-Virus-999 [link] [comments]
    AI that takes a sets of input as images and produces an output
    (1 min) Hello guys! I am a developer with some years of experience, and as a hobby I develop games with Unity. Still haven't published anything, but I like to try new things and I'm really learning a lot. For a game I'm making I needed to have a long list of weapons. I'm in the stage where I'm simply adding content, like those weapons, creating random gun skins and just setting them stats. But then I recalled some images created by some AI, and realised that I can actually make my game infinite if I find a way to "create" new textures with the help of an AI and just set random stats to them. Obviously this would create some simple images. Luckily I'm developing a game in 2D and simple pixel art, with images that are currently 20 x 20 pixels. What I need is an AI that gets a sets of images as an input, learns, creates a model and then, by request, simply creates a new image as an output, based on what he has seen and a random seed. I've seen that GPT-3 can do something like that, but as you probably guessed it -no, I haven't an API key. And I need to create images without a text input. I've also seen some AIs that do exactly the same thing that I want, but they do it only with faces (learns from a sets of faces and then creates new ones). If you can give me a name of a technology, in any language and preferably open source, it would be wonderful. Thanks! submitted by /u/_WindFall_ [link] [comments]
  • Machine Learning

    [D] A Theoretical Comparison of Gradient and Newton Optimization Methods in Machine Learning
    (3 min) I have the following question on Comparing the Properties of Gradient vs. Newton Optimization Algorithms In the context of function optimization, we are told these kinds of general characteristics about both types of algorithms: Gradient Based Optimization Algorithms: Gradient Based Optimization Algorithms (e.g. Gradient Descent) try to find the minimum of a function by repeatedly evaluating the first derivatives of the function. Evaluating the first derivatives of this function leads the optimization algorithm to the next point to consider as the possible minimum - this first derivative evaluation based search for the minimum of the function continues until some convergence criteria is achieved (e.g. difference in successive searches are less than some threshold). Newton Based Optimiza…
    [D] Machine Learning Competencies by Role and Experience
    (1 min) I work in an entry level ML position and I'm looking to get insight regarding key competencies that an ML Engineer should have at various stages of their career. For example, given the roles: Associate ML Engineer, ML Engineer, Senior ML Engineer, Staff ML Engineer, and Principal ML Engineer what frameworks / models would you expect them to have worked with and to what extent? I understand that there will be bias in these responses based upon individual experience and exposure, this is part of why I feel this topic is important. At some level there should be some common agreeances on exposure / competencies and this is more or less what I'm looking for. I'm currently making a learning map where I hope to outline what I know / have exposure to / don't know and then compare my knowledge gaps with a basic benchmark for each of the aforementioned roles and use that to help guide me in my independent studies. Given that I'm single and recently finished my degree I feel that it's very reasonable for me to spend 20 hours/ month of my time filling my knowledge gaps and making sure I cover various key competencies before I look to specialize deeper. Though this post is somewhat career / learning related, I hope it isn't removed as the potential discussion here would likely be great for the community submitted by /u/theRealDavidDavis [link] [comments]
    [R] Co-First Author in CS/ML publication.
    (1 min) Hi What is the view of Co-First authorship in the CS/ML empirical publication? Is the 2nd Co-First Author being viewed as equal weight as the 1st Co-First Author? I heard people are saying "first author or nothing" in the ML community. How does this apply to Co-First authorship? submitted by /u/randy_wales_qq [link] [comments]
    [D] Why are there not much labs that focus on reinforcement learning?
    (2 min) Even in MIT, UCB, Stanford, and CMU, only a handful of labs focus on RL. For example, in Berkeley it seems Abbeel and Levine's labs are the only labs that focus on RL. Outside of these schools and UMass, Brown, Harvard, and UT Austin, there are not many labs that focus on RL. Are there professors focusing on RL in schools like Cornell, Columbia, Georgia tech, UMD, UMich, etc? Also, why do you think there are not many people working on RL? submitted by /u/MysteriousQuantity97 [link] [comments]
    [D] Preprocessing of Wikipedia Dumps for Language Modeling from Scratch
    (1 min) I want to train a language model from scratch on wikipedia dumps of a language, say French. I download the dumps and extract them using the wikiextractor tool. I lower-case everything but keep all the accents, since they are important for French. So far so good, but now it gets blurry. There is very little information about the specifics of preprocessing people are applying to the dumps before training tokenizers and feeding the data into the model. How are section headers etc. removed from the dump (or are they kept in)? How is a wikipedia article split into sequences (i.e. individual samples)? Especially: how do you avoid very short sequences (that need lots of padding) and very long sequences (that will be truncated)? What kind of preprocessing / normalization are people applying? Unicode normalization (NFC?) Moses (pre-)tokenizer? What if I'm using the RoBERTa tokenizer that expects "raw" input data? I hope that some of the practitioners here might be able to share their experiences. submitted by /u/optimized-adam [link] [comments]
    [D] Podcast episodes about the environmental impact of ML
    (0 min) Please share your favorites podcast episodes/talks/debates related to the environmental impact of machine learning. It doesn't matter which platform the podcast/talk is on (can be YouTube, Soundcloud etc). I'll be collecting the episodes that are available on Spotify in this playlist. Sustainable data science podcast episodes submitted by /u/ConfidentAbility [link] [comments]
    [D] When are MCMC methods used for the BART model?
    (1 min) I am studying the BART (Bayesian Additive Regression Trees) model from the following articles: Chipman 2010, Chipman 2007 and Chipman 1998 (https://arxiv.org/pdf/0806.3286.pdf). At the moment I understand this from the model for classification: Prior distribution in trees P(T): P(split) =α(1+d)^-β Prior distribution in terminal nodes P(µ): µ ~ N (0,σ_{µ}^2) Prior distribution for P(σ): Inverse Chi (it's = 1 for classification problems) I know that MCMC looks for values for parameters. In the article they recommend values for α, β and for k and m in σ^2 Question 1: Was MCMC used to calculate those values? If so, How was the likelihood (contains the data)calculated? Question 2: In case the posterior distribution has already been calculated with MCMC, is the posterior predictive distribution useful or is it enough to calculate the parameters? submitted by /u/Garluz [link] [comments]
    [R] Unifying all Machine Learning Frameworks - Link to a free online lecture by the author in comments
    (3 min) submitted by /u/pinter69 [link] [comments]
    [Project] Stacking arbitrary 2D/3D parts together while maximizing the density
    (1 min) Hi, I have a problem where I need to stack 3D parts together. It can be simplified in 2D if needed but 3D would still be great. The parts have different shapes. To start with, I will begin with all the parts having the same shape. A L shaped Tetris block. The goal is simple: I need to stack the parts in a way that the density will be as high as possible. Of course I cannot make a deterministic algorithm because this has to work for all sort of parts shapes. I'm looking forward in these directions: Agent based modeling Reinforcement learning Genetic/Evolutionary algorithms 2D/3D physics simulation frameworks (just needing collisions) Regarding agent based modeling, I found Mesa with Python, but it seems to be pretty basic. Agent.jl in Julia seems to provide a wider set of tools. Agents.jl has collision, at least for 2D circles. I'm not sure it has it for complex shapes. In NetLogo, I found particles collision only. I found tons of Genetic and Reinforcement learning libraries in Python and Julia but nothing related to stacking 2D/3D shapes. Regarding physics I found Pymunk in 2D. In 3D, Pandas 3D. Blender3D also has a Python API. It is a complex problem so I know that I will not find the "ideal" tool. I would appreciate if you can give me some ideas of tools and algorithms I shall explore. Thanks submitted by /u/x11ry0 [link] [comments]
    [D] Suite of tabular datasets for supervised learning algorithm benchmark
    (1 min) I would like to have a suite of public datasets that I can run my experiments against. Ideally, the suite should be accepted by the community to limit suspicion of cherry-picking datasets. Also, I want X,y matrices directly and not raw data that requires preprocessing. In the past, I've used datasets from the UCI repository but it has been a few years back. Is it still the go-to dataset repository or is there something more recent? submitted by /u/Scea91 [link] [comments]
    Classification of Higgs boson decays using machine learning with Python [R]
    (1 min) Hi everybody, I want to present you a project I developed some months ago using Python and machine learning within particle physics topics. Topic: study of the Higgs boson Yukawa coupling to tau leptons using the 2012 ATLAS Run-2 dataset. Particular focus is dedicated to the usage of machine learning classification algorithms to classify the Higgs decay channel H to tautau as signal with respect to the other background processes. For the classification have been considered the cases in which there are 0,1 or 2 jets in the final state. Data was taken from the Higgs challenge kaggle dataset. Let me know what do you think and don't forget to leave a star if you like it! Github: repository: higgs-decay-classification submitted by /u/biancogianluca9 [link] [comments]
    [Discussion] Plagiarised review in top conference
    (5 min) I'm an undergrad and a prof that I am working with asked me to write a review for a paper that was under a double-blind review for a top ML conference -- without any mention of why (I have written paper reviews for them before). Initially I thought that the review was for a paper of their own, and that they wanted some outside perspective, until, I found my review word-for-word posted over on OpenReview (I found out about this only recently, months after the review had been up on OpenReview) The only person who had access to my work, was my professor, and they posted the same without my explicit consent and any prior knowledge. Post me submitting my review to them, they also had a dialgoue with me about the paper, and were essentially seeking my opinion on critiques posted by other reviewers on the respective OpenReview thread by copy-pasting points made in the thread and passing them off as their own. (At that point I still did not have any knowledge that my work was pushed to OpenReview) Is this typical behaviour? How alarmed should I be about this, and what should I do? I hesitate to take any confrontational action since we have been working on research for a while, and already have plans for more down the line. submitted by /u/Ok-Assignment-6621 [link] [comments]
    [D] How does personalization work for semantic search when it is sourced through a thrid party? E.g. Pinecone
    (1 min) There are a few companies offering semantic search api for their customers, I was wondering how is personalization possible in such scenarios? Does the customer have to also push session info to the api to get relevant results back? It might be possible for the customer to re-rank after initial results and match it to the user but I am just interested to learn how personalization can be applied to such a scenario? submitted by /u/notsoserious408 [link] [comments]
    [D] In your opinion, what are some common trends across papers with 'Outstanding Paper' awards?
    (3 min) ^^ Like all awards, I know that the 'Outstanding Paper' distinction can be very arbitrary, particularly since the caliber of accepted research papers in ML conferences tends to be high. I'm starting out in ML as an undergrad and have read a few award-winning papers that were really good, but I can't discern them from some other non-awarded papers, which were also equally impressive. How do committees choose best papers? Are papers chosen based on factors such as novelty, impact, experiment thoroughness, theoretical and empirical contribution? Do 'negative results' rarely, if ever make it into consideration? Do 'best papers' accrue more citations because they've received such a designation, or because they are inherently well-formed in the first place? Do 'best papers' usually come from top research groups (Deepmind, FAIR) or universities, with a lot of resources to run computationally heavy experiments? Looking forward to your thoughts, and I apologize in advance if some of my wording is off. Thanks! submitted by /u/akardashian [link] [comments]
  • Neural Networks, Deep Learning and Machine Learning

    How to train a neutral network for regression if some of the output data points are way bigger than the others?
    (2 min) First of all, I would like to thank the members of this sub for help with me previous post. I had been trying to train a neutral network for regression and the input I recieved here has made a lot of difference. My network now has a mean absolute error of 0.005 and a mean absolute percentage error of 28%. My only problem is that there's one data point in my output that's an order of magnitude bigger larger than the others and my network struggles to produce that point. How do I go ahead? Any ideas? I thought of using a separate network for points which are bigger than the others, but I feel like there aren't enough points for the training of such a network to be viable. submitted by /u/eva01beast [link] [comments]
  • Reinforcement Learning

    Using RL to understand the Brain
    (2 min) I am trying to use reinforcement learning to understand how the brain works. My dataset is the neural activity pattern generated by a subject and the resulting behavior from that neural activity pattern. The relationship between neural activity and behavior is known to the experimenters. The subject is trying to modulate his neural activity to move a computer cursor and hit a target. Hitting the target is the goal for the subject and thus could be considered getting rewarded. While we observe the neural activity pattern produced, it is unknown what neural activity pattern was *intended* to be produced. If we consider the subject as the agent, then the action would be trying to producing a neural activity pattern. In this scenario, we don't know the action the agent took (the desired neural activity pattern), only the result of that action (the observed neural activity pattern). Furthermore, we don't know the agents policy or the dynamics of the environment (in this case, dynamics of the environment would be the dynamics/noise within the brain). In summary: agent = the subject action = desired neural activity (unknown) environment = the noise and neural dynamics of the brain (unknown) policy = what neural activity patterns to make and when to make them (unknown and changing as the subject learns) result from actions = the neural activity produced (known) Would there be any way of using reinforcement learning to learn (at least a subset) of these unknown parameters? Would traditional inverse reinforcement learning be appropriate here? Thanks! submitted by /u/sdrinz [link] [comments]
    Quick Markov Reward Process question
    (2 min) So I'm trying to calculate the state value in an example I came up with. It's very simple, but I'm confused how to do it because of the v(s) formula: https://imgur.com/a/eJJ27Bw If my understanding is correct, Rs can be interpreted as the reward from entering state s, but also the reward from leaving state s, right? In the case of being the reward from leaving state s, how do you calculate it? In the formula it shows Rs and the transition probabilities are only interacting with v(s'). Intuitively, I'm thinking of calculating it like this: Rs = 1*0.9 -1*0.1 = 0.8 v(State 1) = 0.8 + gamma*(0.9*v(State 2) + 0.1*v(State 3)) But the formula simply shows Rs and not the probabilities, so I'm not sure. submitted by /u/pedrohpf [link] [comments]
    Windows or Linux distro for RL?
    (1 min) submitted by /u/ZazieIsInTheHouse [link] [comments]
    How can you do RL on a graph?
    (2 min) I am new to stable-baselines3 and am trying to get a toy problem to work. I previously had a bit flipping example using an array and now I would like to do the same thing using a graph with node weights instead. I am not sure how to do this. The following code snippet will make a linear graph with 10 nodes, add node weights to each node and convert it to a dgl graph import networkx as nx import random import dgl # Create edges to add edges = [] N = 10 for i in range(N-1): edges.append((i, i+1)) # Create graph and convert it into a dgl graph G=nx.DiGraph() G.add_edges_from(edges) for i in range(len(G.nodes)): G.nodes[i]['weight'] = random.choice([0,1]) dgl_graph = dgl.from_networkx(G, node_attrs=["weight"]) When I was using a linear array for the bit flipping example my environment was …
    What framework would you recommend to build a Tetris game AI using reinforcement learning?
    (1 min) Dear, I'm looking forward to building a sort of Tetris game AI. To start with, with just one shape, then I will complexify. For example, I need to pile up L-shaped parts in such a way that they maximize the density. The difference with the classical Tetris game is that the parts are not constrained in their orientation. It can be any angle from 0° to 360°. I don't really know what framework I can use. I had a look at NetLogo but I'm unsure it has the capacity to build L-shaped agents, handle stacking, or measure density. I has a look to Julia too. There are nice tools build by JuliaDynamics. I.e. Agents.jl for agent based modeling. It handles collisions. There is also a framework for reinforcement learning. Also for Genetic Algorithms. Then I found a set of libraries related to Geometry. But it seems to be a lot of work to put that together for my use case. With Python, I found Mesa for agents but it does not seems to have the capacity either and seems quite limited compared to agents.jl. Do you have some ideas to start with? Thanks submitted by /u/x11ry0 [link] [comments]
    Are there any follow-up studies of RL^2 algorithms?
    (1 min) Hi r/reinforcementlearning! I recently started to be interested in meta-reinforcement learning, and I am particularly interested in models using recurrent neural networks such as RL2. But after few search I found that most of the recent approach for meta-reinforcement learning is based on MAML method, Although RL2 performed very well in meta rl benchmark paper, meta-world. And it was hard to find follow-up research of RL2 at the same time. Does anyone knows about follow-up researches of RL2? Thanks! submitted by /u/jinPrelude [link] [comments]
    Slow learning even though the reward is shaped
    (1 min) Hello, I have an RL agent which has to move to a certain target location. Currently, the area is 1000x1000 units and agent moves at 100 units/step (in any direction). At each step, the agent gets a positive reward for moving towards the target, and a negative reward otherwise (hence the reward is shaped; the agent does not get anything more at the end of the episode). This works perfectly fine, and the agent learns to go to the target. However, once I reduce the speed of the agent (say 30), the learning becomes considerably slower. I'm using PPO, and the hyperparameters are unchanged from the 100 speed scenario. What could cause this problem? episode length? batch size? any insight is appreciated :D submitted by /u/AhmedNizam_ [link] [comments]

2022-01-22

  • cs.LG updates on arXiv.org

    Taurus: A Data Plane Architecture for Per-Packet ML. (arXiv:2002.08987v2 [cs.NI] UPDATED)
    (0 min) Emerging applications -- cloud computing, the internet of things, and augmented/virtual reality -- demand responsive, secure, and scalable datacenter networks. These networks currently implement simple, per-packet, data-plane heuristics (e.g., ECMP and sketches) under a slow, millisecond-latency control plane that runs data-driven performance and security policies. However, to meet applications' service-level objectives (SLOs) in a modern data center, networks must bridge the gap between line-rate, per-packet execution and complex decision making. In this work, we present the design and implementation of Taurus, a data plane for line-rate inference. Taurus adds custom hardware based on a flexible, parallel-patterns (MapReduce) abstraction to programmable network devices, such as switches and NICs; this new hardware uses pipelined SIMD parallelism to enable per-packet MapReduce operations (e.g., inference). Our evaluation of a Taurus switch ASIC -- supporting several real-world models -- shows that Taurus operates orders of magnitude faster than a server-based control plane while increasing area by 3.8% and latency for line-rate ML models by up to 221 ns. Furthermore, our Taurus FPGA prototype achieves full model accuracy and detects two orders of magnitude more events than a state-of-the-art control-plane anomaly-detection system.
    Towards Energy Efficient Distributed Federated Learning for 6G Networks. (arXiv:2201.08270v1 [cs.LG])
    (0 min) The provision of communication services via portable and mobile devices, such as aerial base stations, is a crucial concept to be realized in 5G/6G networks. Conventionally, IoT/edge devices need to transmit the data directly to the base station for training the model using machine learning techniques. The data transmission introduces privacy issues that might lead to security concerns and monetary losses. Recently, Federated learning was proposed to partially solve privacy issues via model-sharing with base station. However, the centralized nature of federated learning only allow the devices within the vicinity of base stations to share the trained models. Furthermore, the long-range communication compels the devices to increase transmission power, which raises the energy efficiency concerns. In this work, we propose distributed federated learning (DBFL) framework that overcomes the connectivity and energy efficiency issues for distant devices. The DBFL framework is compatible with mobile edge computing architecture that connects the devices in a distributed manner using clustering protocols. Experimental results show that the framework increases the classification performance by 7.4\% in comparison to conventional federated learning while reducing the energy consumption.
    Dimensionality reduction to maximize prediction generalization capability. (arXiv:2003.00470v2 [stat.ML] UPDATED)
    (0 min) Generalization of time series prediction remains an important open issue in machine learning, wherein earlier methods have either large generalization error or local minima. We develop an analytically solvable, unsupervised learning scheme that extracts the most informative components for predicting future inputs, termed predictive principal component analysis (PredPCA). Our scheme can effectively remove unpredictable noise and minimize test prediction error through convex optimization. Mathematical analyses demonstrate that, provided with sufficient training samples and sufficiently high-dimensional observations, PredPCA can asymptotically identify hidden states, system parameters, and dimensionalities of canonical nonlinear generative processes, with a global convergence guarantee. We demonstrate the performance of PredPCA using sequential visual inputs comprising hand-digits, rotating 3D objects, and natural scenes. It reliably estimates distinct hidden states and predicts future outcomes of previously unseen test input data, based exclusively on noisy observations. The simple architecture and low computational cost of PredPCA are highly desirable for neuromorphic hardware.
    Spectral Robustness for Correlation Clustering Reconstruction in Semi-Adversarial Models. (arXiv:2108.04729v2 [cs.DS] UPDATED)
    (0 min) Correlation Clustering is an important clustering problem with many applications. We study the reconstruction version of this problem in which one is seeking to reconstruct a latent clustering that has been corrupted by random noise and adversarial modifications. Concerning the latter, there is a standard "post-adversarial" model in the literature, in which adversarial modifications come after the noise. Here, we introduce and analyse a "pre-adversarial" model in which adversarial modifications come before the noise. Given an input coming from such a semi-adversarial generative model, the goal is to reconstruct almost perfectly and with high probability the latent clustering. We focus on the case where the hidden clusters have nearly equal size and show the following. In the pre-adversarial setting, spectral algorithms are optimal, in the sense that they reconstruct all the way to the information-theoretic threshold beyond which no reconstruction is possible. This is in contrast to the post-adversarial setting, in which their ability to restore the hidden clusters stops before the threshold, but the gap is optimally filled by SDP-based algorithms. These results highlight a heretofore unknown robustness of spectral algorithms, showing them less brittle than previously thought.
    Lead-lag detection and network clustering for multivariate time series with an application to the US equity market. (arXiv:2201.08283v1 [stat.ML])
    (0 min) In multivariate time series systems, it has been observed that certain groups of variables partially lead the evolution of the system, while other variables follow this evolution with a time delay; the result is a lead-lag structure amongst the time series variables. In this paper, we propose a method for the detection of lead-lag clusters of time series in multivariate systems. We demonstrate that the web of pairwise lead-lag relationships between time series can be helpfully construed as a directed network, for which there exist suitable algorithms for the detection of pairs of lead-lag clusters with high pairwise imbalance. Within our framework, we consider a number of choices for the pairwise lead-lag metric and directed network clustering components. Our framework is validated on both a synthetic generative model for multivariate lead-lag time series systems and daily real-world US equity prices data. We showcase that our method is able to detect statistically significant lead-lag clusters in the US equity market. We study the nature of these clusters in the context of the empirical finance literature on lead-lag relations and demonstrate how these can be used for the construction of predictive financial signals.
    Spectral embedding for dynamic networks with stability guarantees. (arXiv:2106.01282v2 [stat.ML] UPDATED)
    (0 min) We consider the problem of embedding a dynamic network, to obtain time-evolving vector representations of each node, which can then be used to describe changes in behaviour of individual nodes, communities, or the entire graph. Given this open-ended remit, we argue that two types of stability in the spatio-temporal positioning of nodes are desirable: to assign the same position, up to noise, to nodes behaving similarly at a given time (cross-sectional stability) and a constant position, up to noise, to a single node behaving similarly across different times (longitudinal stability). Similarity in behaviour is defined formally using notions of exchangeability under a dynamic latent position network model. By showing how this model can be recast as a multilayer random dot product graph, we demonstrate that unfolded adjacency spectral embedding satisfies both stability conditions. We also show how two alternative methods, omnibus and independent spectral embedding, alternately lack one or the other form of stability.
    Statistical prediction of extreme events from small datasets. (arXiv:2201.08294v1 [physics.flu-dyn])
    (0 min) We propose Echo State Networks (ESNs) to predict the statistics of extreme events in a turbulent flow. We train the ESNs on small datasets that lack information about the extreme events. We asses whether the networks are able to extrapolate from the small imperfect datasets and predict the heavy-tail statistics that describe the events. We find that the networks correctly predict the events and improve the statistics of the system with respect to the training data in almost all cases analysed. This opens up new possibilities for the statistical prediction of extreme events in turbulence.
    Cross DQN: Cross Deep Q Network for Ads Allocation in Feed. (arXiv:2109.04353v3 [cs.LG] UPDATED)
    (2 min) E-commerce platforms usually display a mixed list of ads and organic items in feed. One key problem is to allocate the limited slots in the feed to maximize the overall revenue as well as improve user experience, which requires a good model for user preference. Instead of modeling the influence of individual items on user behaviors, the arrangement signal models the influence of the arrangement of items and may lead to a better allocation strategy. However, most of previous strategies fail to model such a signal and therefore result in suboptimal performance. In addition, the percentage of ads exposed (PAE) is an important indicator in ads allocation. Excessive PAE hurts user experience while too low PAE reduces platform revenue. Therefore, how to constrain the PAE within a certain range while keeping personalized recommendation under the PAE constraint is a challenge. In this paper, we propose Cross Deep Q Network (Cross DQN) to extract the crucial arrangement signal by crossing the embeddings of different items and modeling the crossed sequence by multi-channel attention. Besides, we propose an auxiliary loss for batch-level constraint on PAE to tackle the above-mentioned challenge. Our model results in higher revenue and better user experience than state-of-the-art baselines in offline experiments. Moreover, our model demonstrates a significant improvement in the online A/B test and has been fully deployed on Meituan feed to serve more than 300 millions of customers.
    Shared Visual Representations of Drawing for Communication: How do different biases affect human interpretability and intent?. (arXiv:2110.08203v2 [cs.LG] UPDATED)
    (2 min) We present an investigation into how representational losses can affect the drawings produced by artificial agents playing a communication game. Building upon recent advances, we show that a combination of powerful pretrained encoder networks, with appropriate inductive biases, can lead to agents that draw recognisable sketches, whilst still communicating well. Further, we start to develop an approach to help automatically analyse the semantic content being conveyed by a sketch and demonstrate that current approaches to inducing perceptual biases lead to a notion of objectness being a key feature despite the agent training being self-supervised.
    Fixing the train-test resolution discrepancy. (arXiv:1906.06423v4 [cs.CV] UPDATED)
    (2 min) Data-augmentation is key to the training of neural networks for image classification. This paper first shows that existing augmentations induce a significant discrepancy between the typical size of the objects seen by the classifier at train and test time. We experimentally validate that, for a target test resolution, using a lower train resolution offers better classification at test time. We then propose a simple yet effective and efficient strategy to optimize the classifier performance when the train and test resolutions differ. It involves only a computationally cheap fine-tuning of the network at the test resolution. This enables training strong classifiers using small training images. For instance, we obtain 77.1% top-1 accuracy on ImageNet with a ResNet-50 trained on 128x128 images, and 79.8% with one trained on 224x224 image. In addition, if we use extra training data we get 82.5% with the ResNet-50 train with 224x224 images. Conversely, when training a ResNeXt-101 32x48d pre-trained in weakly-supervised fashion on 940 million public images at resolution 224x224 and further optimizing for test resolution 320x320, we obtain a test top-1 accuracy of 86.4% (top-5: 98.0%) (single-crop). To the best of our knowledge this is the highest ImageNet single-crop, top-1 and top-5 accuracy to date.
    BINAS: Bilinear Interpretable Neural Architecture Search. (arXiv:2110.12399v2 [cs.LG] UPDATED)
    (2 min) Practical use of neural networks often involves requirements on latency, energy and memory among others. A popular approach to find networks under such requirements is through constrained Neural Architecture Search (NAS). However, previous methods use complicated predictors for the accuracy of the network. Those predictors are hard to interpret and sensitive to many hyperparameters to be tuned, hence, the resulting accuracy of the generated models is often harmed. In this work we resolve this by introducing Bilinear Interpretable Neural Architecture Search (BINAS), that is based on an accurate and simple bilinear formulation of both an accuracy estimator and the expected resource requirement, together with a scalable search method with theoretical guarantees. The simplicity of our proposed estimator together with the intuitive way it is constructed bring interpretability through many insights about the contribution of different design choices. For example, we find that in the examined search space, adding depth and width is more effective at deeper stages of the network and at the beginning of each resolution stage. Our experiments show that BINAS generates comparable to or better architectures than other state-of-the-art NAS methods within a reduced marginal search cost, while strictly satisfying the resource constraints.
    Direct Behavior Specification via Constrained Reinforcement Learning. (arXiv:2112.12228v2 [cs.LG] UPDATED)
    (0 min) The standard formulation of Reinforcement Learning lacks a practical way of specifying what are admissible and forbidden behaviors. Most often, practitioners go about the task of behavior specification by manually engineering the reward function, a counter-intuitive process that requires several iterations and is prone to reward hacking by the agent. In this work, we argue that constrained RL, which has almost exclusively been used for safe RL, also has the potential to significantly reduce the amount of work spent for reward specification in applied RL projects. To this end, we propose to specify behavioral preferences in the CMDP framework and to use Lagrangian methods to automatically weigh each of these behavioral constraints. Specifically, we investigate how CMDPs can be adapted to solve goal-based tasks while adhering to several constraints simultaneously. We evaluate this framework on a set of continuous control tasks relevant to the application of Reinforcement Learning for NPC design in video games.
    Conditional Generation of Medical Time Series for Extrapolation to Underrepresented Populations. (arXiv:2201.08186v1 [cs.LG])
    (2 min) The widespread adoption of electronic health records (EHRs) and subsequent increased availability of longitudinal healthcare data has led to significant advances in our understanding of health and disease with direct and immediate impact on the development of new diagnostics and therapeutic treatment options. However, access to EHRs is often restricted due to their perceived sensitive nature and associated legal concerns, and the cohorts therein typically are those seen at a specific hospital or network of hospitals and therefore not representative of the wider population of patients. Here, we present HealthGen, a new approach for the conditional generation of synthetic EHRs that maintains an accurate representation of real patient characteristics, temporal information and missingness patterns. We demonstrate experimentally that HealthGen generates synthetic cohorts that are significantly more faithful to real patient EHRs than the current state-of-the-art, and that augmenting real data sets with conditionally generated cohorts of underrepresented subpopulations of patients can significantly enhance the generalisability of models derived from these data sets to different patient populations. Synthetic conditionally generated EHRs could help increase the accessibility of longitudinal healthcare data sets and improve the generalisability of inferences made from these data sets to underrepresented populations.
    Prediction of the electron density of states for crystalline compounds with Atomistic Line Graph Neural Networks (ALIGNN). (arXiv:2201.08348v1 [cond-mat.mtrl-sci])
    (2 min) Machine learning (ML) based models have greatly enhanced the traditional materials discovery and design pipeline. Specifically, in recent years, surrogate ML models for material property prediction have demonstrated success in predicting discrete scalar-valued target properties to within reasonable accuracy of their DFT-computed values. However, accurate prediction of spectral targets such as the electron Density of States (DOS) poses a much more challenging problem due to the complexity of the target, and the limited amount of available training data. In this study, we present an extension of the recently developed Atomistic Line Graph Neural Network (ALIGNN) to accurately predict DOS of a large set of material unit cell structures, trained to the publicly available JARVIS-DFT dataset. Furthermore, we evaluate two methods of representation of the target quantity - a direct discretized spectrum, and a compressed low-dimensional representation obtained using an autoencoder. Through this work, we demonstrate the utility of graph-based featurization and modeling methods in the prediction of complex targets that depend on both chemistry and directional characteristics of material structures.
    Clustering with Semidefinite Programming and Fixed Point Iteration. (arXiv:2012.09202v2 [math.OC] UPDATED)
    (2 min) We introduce a novel method for clustering using a semidefinite programming (SDP) relaxation of the Max k-Cut problem. The approach is based on a new methodology for rounding the solution of an SDP relaxation using iterated linear optimization. We show the vertices of the Max k-Cut SDP relaxation correspond to partitions of the data into at most k sets. We also show the vertices are attractive fixed points of iterated linear optimization. Each step of this iterative procedure solves a relaxation of the closest vertex problem and leads to a new clustering problem where the underlying clusters are more clearly defined. Our experiments show that using fixed point iteration for rounding the Max k-Cut SDP relaxation leads to significantly better results when compared to randomized rounding.
    WPPNets: Unsupervised CNN Training with Wasserstein Patch Priors for Image Superresolution. (arXiv:2201.08157v1 [cs.CV])
    (0 min) We introduce WPPNets, which are CNNs trained by a new unsupervised loss function for image superresolution of materials microstructures. Instead of requiring access to a large database of registered high- and low-resolution images, we only assume to know a large database of low resolution images, the forward operator and one high-resolution reference image. Then, we propose a loss function based on the Wasserstein patch prior which measures the Wasserstein-2 distance between the patch distributions of the predictions and the reference image. We demonstrate by numerical examples that WPPNets outperform other methods with similar assumptions. In particular, we show that WPPNets are much more stable under inaccurate knowledge or perturbations of the forward operator. This enables us to use them in real-world applications, where neither a large database of registered data nor the exact forward operator are given.
    From Psychological Curiosity to Artificial Curiosity: Curiosity-Driven Learning in Artificial Intelligence Tasks. (arXiv:2201.08300v1 [cs.AI])
    (2 min) Psychological curiosity plays a significant role in human intelligence to enhance learning through exploration and information acquisition. In the Artificial Intelligence (AI) community, artificial curiosity provides a natural intrinsic motivation for efficient learning as inspired by human cognitive development; meanwhile, it can bridge the existing gap between AI research and practical application scenarios, such as overfitting, poor generalization, limited training samples, high computational cost, etc. As a result, curiosity-driven learning (CDL) has become increasingly popular, where agents are self-motivated to learn novel knowledge. In this paper, we first present a comprehensive review on the psychological study of curiosity and summarize a unified framework for quantifying curiosity as well as its arousal mechanism. Based on the psychological principle, we further survey the literature of existing CDL methods in the fields of Reinforcement Learning, Recommendation, and Classification, where both advantages and disadvantages as well as future work are discussed. As a result, this work provides fruitful insights for future CDL research and yield possible directions for further improvement.
    Landscape of Neural Architecture Search across sensors: how much do they differ ?. (arXiv:2201.06321v2 [cs.LG] UPDATED)
    (2 min) With the rapid rise of neural architecture search, the ability to understand its complexity from the perspective of a search algorithm is desirable. Recently, Traor\'e et al. have proposed the framework of Fitness Landscape Footprint to help describe and compare neural architecture search problems. It attempts at describing why a search strategy might be successful, struggle or fail on a target task. Our study leverages this methodology in the context of searching across sensors, including sensor data fusion. In particular, we apply the Fitness Landscape Footprint to the real-world image classification problem of So2Sat LCZ42, in order to identify the most beneficial sensor to our neural network hyper-parameter optimization problem. From the perspective of distributions of fitness, our findings indicate a similar behaviour of the search space for all sensors: the longer the training time, the larger the overall fitness, and more flatness in the landscapes (less ruggedness and deviation). Regarding sensors, the better the fitness they enable (Sentinel-2), the better the search trajectories (smoother, higher persistence). Results also indicate very similar search behaviour for sensors that can be decently fitted by the search space (Sentinel-2 and fusion).
    BSODA: A Bipartite Scalable Framework for Online Disease Diagnosis. (arXiv:2012.01065v3 [cs.LG] UPDATED)
    (2 min) A growing number of people are seeking healthcare advice online. Usually, they diagnose their medical conditions based on the symptoms they are experiencing, which is also known as self-diagnosis. From the machine learning perspective, online disease diagnosis is a sequential feature (symptom) selection and classification problem. Reinforcement learning (RL) methods are the standard approaches to this type of tasks. Generally, they perform well when the feature space is small, but frequently become inefficient in tasks with a large number of features, such as the self-diagnosis. To address the challenge, we propose a non-RL Bipartite Scalable framework for Online Disease diAgnosis, called BSODA. BSODA is composed of two cooperative branches that handle symptom-inquiry and disease-diagnosis, respectively. The inquiry branch determines which symptom to collect next by an information-theoretic reward. We employ a Product-of-Experts encoder to significantly improve the handling of partial observations of a large number of features. Besides, we propose several approximation methods to substantially reduce the computational cost of the reward to a level that is acceptable for online services. Additionally, we leverage the diagnosis model to estimate the reward more precisely. For the diagnosis branch, we use a knowledge-guided self-attention model to perform predictions. In particular, BSODA determines when to stop inquiry and output predictions using both the inquiry and diagnosis models. We demonstrate that BSODA outperforms the state-of-the-art methods on several public datasets. Moreover, we propose a novel evaluation method to test the transferability of symptom checking methods from synthetic to real-world tasks. Compared to existing RL baselines, BSODA is more effectively scalable to large search spaces.
    Scalable Graph Embedding LearningOn A Single GPU. (arXiv:2110.06991v2 [cs.LG] UPDATED)
    (2 min) Graph embedding techniques have attracted growing interest since they convert the graph data into continuous and low-dimensional space. Effective graph analytic provides users a deeper understanding of what is behind the data and thus can benefit a variety of machine learning tasks. With the current scale of real-world applications, most graph analytic methods suffer high computation and space costs. These methods and systems can process a network with thousands to a few million nodes. However, scaling to large-scale networks remains a challenge. The complexity of training graph embedding system requires the use of existing accelerators such as GPU. In this paper, we introduce a hybrid CPU-GPU framework that addresses the challenges of learning embedding of large-scale graphs. The performance of our method is compared qualitatively and quantitatively with the existing embedding systems on common benchmarks. We also show that our system can scale training to datasets with an order of magnitude greater than a single machine's total memory capacity. The effectiveness of the learned embedding is evaluated within multiple downstream applications. The experimental results indicate the effectiveness of the learned embedding in terms of performance and accuracy.
    Improving the quality control of seismic data through active learning. (arXiv:2201.06616v2 [stat.ML] UPDATED)
    (2 min) In image denoising problems, the increasing density of available images makes an exhaustive visual inspection impossible and therefore automated methods based on machine-learning must be deployed for this purpose. This is particulary the case in seismic signal processing. Engineers/geophysicists have to deal with millions of seismic time series. Finding the sub-surface properties useful for the oil industry may take up to a year and is very costly in terms of computing/human resources. In particular, the data must go through different steps of noise attenuation. Each denoise step is then ideally followed by a quality control (QC) stage performed by means of human expertise. To learn a quality control classifier in a supervised manner, labeled training data must be available, but collecting the labels from human experts is extremely time-consuming. We therefore propose a novel active learning methodology to sequentially select the most relevant data, which are then given back to a human expert for labeling. Beyond the application in geophysics, the technique we promote in this paper, based on estimates of the local error and its uncertainty, is generic. Its performance is supported by strong empirical evidence, as illustrated by the numerical experiments presented in this article, where it is compared to alternative active learning strategies both on synthetic and real seismic datasets.
    Kernel Methods and Multi-layer Perceptrons Learn Linear Models in High Dimensions. (arXiv:2201.08082v1 [stat.ML])
    (2 min) Empirical observation of high dimensional phenomena, such as the double descent behaviour, has attracted a lot of interest in understanding classical techniques such as kernel methods, and their implications to explain generalization properties of neural networks. Many recent works analyze such models in a certain high-dimensional regime where the covariates are independent and the number of samples and the number of covariates grow at a fixed ratio (i.e. proportional asymptotics). In this work we show that for a large class of kernels, including the neural tangent kernel of fully connected networks, kernel methods can only perform as well as linear models in this regime. More surprisingly, when the data is generated by a kernel model where the relationship between input and the response could be very nonlinear, we show that linear models are in fact optimal, i.e. linear models achieve the minimum risk among all models, linear or nonlinear. These results suggest that more complex models for the data other than independent features are needed for high-dimensional analysis.
    Speeding up Learning Quantum States through Group Equivariant Convolutional Quantum Ans\"atze. (arXiv:2112.07611v2 [quant-ph] UPDATED)
    (2 min) We develop a theoretical framework for $S_n$-equivariant quantum convolutional circuits, building on and significantly generalizing Jordan's Permutational Quantum Computing (PQC) formalism. We show that quantum circuits are a natural choice for Fourier space neural architectures affording a super-exponential speedup in computing the matrix elements of $S_n$-Fourier coefficients compared to the best known classical Fast Fourier Transform (FFT) over the symmetric group. In particular, we utilize the Okounkov-Vershik approach to prove Harrow's statement (Ph.D. Thesis 2005 p.160) on the equivalence between $\operatorname{SU}(d)$- and $S_n$-irrep bases and to establish the $S_n$-equivariant Convolutional Quantum Alternating Ans\"atze ($S_n$-CQA) using Young-Jucys-Murphy (YJM) elements. We prove that $S_n$-CQA are dense, thus expressible within each $S_n$-irrep block, which may serve as a universal model for potential future quantum machine learning and optimization applications. Our method provides another way to prove the universality of Quantum Approximate Optimization Algorithm (QAOA), from the representation-theoretical point of view. Our framework can be naturally applied to a wide array of problems with global $\operatorname{SU}(d)$ symmetry. We present numerical simulations to showcase the effectiveness of the ans\"atze to find the sign structure of the ground state of the $J_1$-$J_2$ antiferromagnetic Heisenberg model on the rectangular and Kagome lattices. Our work identifies quantum advantage for a specific machine learning problem, and provides the first application of the celebrated Okounkov-Vershik's representation theory to machine learning and quantum physics.
    Approximated Multi-Agent Fitted Q Iteration. (arXiv:2104.09343v3 [cs.LG] UPDATED)
    (2 min) We formulate an efficient approximation for multi-agent batch reinforcement learning, the approximated multi-agent fitted Q iteration (AMAFQI). We present a detailed derivation of our approach. We propose an iterative policy search and show that it yields a greedy policy with respect to multiple approximations of the centralized, learned Q-function. In each iteration and policy evaluation, AMAFQI requires a number of computations that scales linearly with the number of agents whereas the analogous number of computations increase exponentially for the fitted Q iteration (FQI), a commonly used approaches in batch reinforcement learning. This property of AMAFQI is fundamental for the design of a tractable multi-agent approach. We evaluate the performance of AMAFQI and compare it to FQI in numerical simulations. The simulations illustrate the significant computation time reduction when using AMAFQI instead of FQI in multi-agent problems and corroborate the similar performance of both approaches.
    ASM2TV: An Adaptive Semi-Supervised Multi-Task Multi-View Learning Framework for Human Activity Recognition. (arXiv:2105.08643v2 [cs.LG] UPDATED)
    (2 min) Many real-world scenarios, such as human activity recognition (HAR) in IoT, can be formalized as a multi-task multi-view learning problem. Each specific task consists of multiple shared feature views collected from multiple sources, either homogeneous or heterogeneous. Common among recent approaches is to employ a typical hard/soft sharing strategy at the initial phase separately for each view across tasks to uncover common knowledge, underlying the assumption that all views are conditionally independent. On the one hand, multiple views across tasks possibly relate to each other under practical situations. On the other hand, supervised methods might be insufficient when labeled data is scarce. To tackle these challenges, we introduce a novel framework ASM2TV for semi-supervised multi-task multi-view learning. We present a new perspective named gating control policy, a learnable task-view-interacted sharing policy that adaptively selects the most desirable candidate shared block for any view across any task, which uncovers more fine-grained task-view-interacted relatedness and improves inference efficiency. Significantly, our proposed gathering consistency adaption procedure takes full advantage of large amounts of unlabeled fragmented time-series, making it a general framework that accommodates a wide range of applications. Experiments on two diverse real-world HAR benchmark datasets collected from various subjects and sources demonstrate our framework's superiority over other state-of-the-arts. The detailed codes are available at https://github.com/zachstarkk/ASM2TV.
    EasyFL: A Low-code Federated Learning Platform For Dummies. (arXiv:2105.07603v3 [cs.DC] UPDATED)
    (2 min) Academia and industry have developed several platforms to support the popular privacy-preserving distributed learning method -- Federated Learning (FL). However, these platforms are complex to use and require a deep understanding of FL, which imposes high barriers to entry for beginners, limits the productivity of researchers, and compromises deployment efficiency. In this paper, we propose the first low-code FL platform, EasyFL, to enable users with various levels of expertise to experiment and prototype FL applications with little coding. We achieve this goal while ensuring great flexibility and extensibility for customization by unifying simple API design, modular design, and granular training flow abstraction. With only a few lines of code, EasyFL empowers them with many out-of-the-box functionalities to accelerate experimentation and deployment. These practical functionalities are heterogeneity simulation, comprehensive tracking, distributed training optimization, and seamless deployment. They are proposed based on challenges identified in the proposed FL life cycle. Compared with other platforms, EasyFL not only requires just three lines of code (at least 10x lesser) to build a vanilla FL application but also incurs lower training overhead. Besides, our evaluations demonstrate that EasyFL expedites distributed training by 1.5x. It also improves the efficiency of deployment. We believe that EasyFL will increase the productivity of researchers and democratize FL to wider audiences.
    Survey on Causal-based Machine Learning Fairness Notions. (arXiv:2010.09553v4 [cs.LG] UPDATED)
    (2 min) Addressing the problem of fairness is crucial to safely use machine learning algorithms to support decisions with a critical impact on people's lives such as job hiring, child maltreatment, disease diagnosis, loan granting, etc. Several notions of fairness have been defined and examined in the past decade, such as, statistical parity and equalized odds. The most recent fairness notions, however, are causal-based and reflect the now widely accepted idea that using causality is necessary to appropriately address the problem of fairness. This paper examines an exhaustive list of causal-based fairness notions, in particular their applicability in real-world scenarios. As the majority of causal-based fairness notions are defined in terms of non-observable quantities (e.g. interventions and counterfactuals), their applicability depends heavily on the identifiability of those quantities from observational data. In this paper, we compile the most relevant identifiability criteria for the problem of fairness from the extensive literature on identifiability theory. These criteria are then used to decide about the applicability of causal-based fairness notions in concrete discrimination scenarios.
    Identifying critical nodes in complex networks by graph representation learning. (arXiv:2201.07988v1 [cs.SI])
    (2 min) Because of its wide application, critical nodes identification has become an important research topic at the micro level of network science. Influence maximization is one of the main problems in critical nodes mining and is usually handled with heuristics. In this paper, a deep graph learning framework IMGNN is proposed and the corresponding training sample generation scheme is designed. The framework takes centralities of nodes in a network as input and the probability that nodes in the optimal initial spreaders as output. By training on a large number of small synthetic networks, IMGNN is more efficient than human-based heuristics in minimizing the size of initial spreaders under the fixed infection scale. The experimental results on one synthetic and five real networks show that, compared with traditional non-iterative node ranking algorithms, IMGNN has the smallest proportion of initial spreaders under different infection probabilities when the final infection scale is fixed. And the reordered version of IMGNN outperforms all the latest critical nodes mining algorithms.
    Omnivore: A Single Model for Many Visual Modalities. (arXiv:2201.08377v1 [cs.CV])
    (2 min) Prior work has studied different visual modalities in isolation and developed separate architectures for recognition of images, videos, and 3D data. Instead, in this paper, we propose a single model which excels at classifying images, videos, and single-view 3D data using exactly the same model parameters. Our 'Omnivore' model leverages the flexibility of transformer-based architectures and is trained jointly on classification tasks from different modalities. Omnivore is simple to train, uses off-the-shelf standard datasets, and performs at-par or better than modality-specific models of the same size. A single Omnivore model obtains 86.0% on ImageNet, 84.1% on Kinetics, and 67.1% on SUN RGB-D. After finetuning, our models outperform prior work on a variety of vision tasks and generalize across modalities. Omnivore's shared visual representation naturally enables cross-modal recognition without access to correspondences between modalities. We hope our results motivate researchers to model visual modalities together.
    Exact Statistical Inference for the Wasserstein Distance by Selective Inference. (arXiv:2109.14206v3 [stat.ML] UPDATED)
    (2 min) In this paper, we study statistical inference for the Wasserstein distance, which has attracted much attention and has been applied to various machine learning tasks. Several studies have been proposed in the literature, but almost all of them are based on asymptotic approximation and do not have finite-sample validity. In this study, we propose an exact (non-asymptotic) inference method for the Wasserstein distance inspired by the concept of conditional Selective Inference (SI). To our knowledge, this is the first method that can provide a valid confidence interval (CI) for the Wasserstein distance with finite-sample coverage guarantee, which can be applied not only to one-dimensional problems but also to multi-dimensional problems. We evaluate the performance of the proposed method on both synthetic and real-world datasets.
    Integration of adaptive control and reinforcement learning approaches for real-time control and learning. (arXiv:2105.06577v3 [cs.LG] UPDATED)
    (2 min) This paper considers the problem of real-time control and learning in dynamic systems subjected to parametric uncertainties and proposes a controller that combines Adaptive Control (AC) in the inner loop and a Reinforcement Learning (RL) based policy in the outer loop. Two classes of nonlinear dynamic systems are considered, both of which are control-affine. The first class of dynamic systems utilizes equilibrium points with expansion forms around these points and employs a Lyapunov approach. The second class of nonlinear systems uses contraction theory as the underlying framework. For both classes of systems, the AC-RL controller is shown to lead to online policies that guarantee stability, and leverage accelerated convergence properties using a high-order tuner. Additionally, for the second class of systems, the AC-RL controller is shown to lead to parameter learning with persistent excitation. Numerical validations of all algorithms are carried out using a quadrotor landing task on a moving platform and other academic examples. All results clearly point out the advantage of the proposed integrative AC-RL approach.
    Stitch it in Time: GAN-Based Facial Editing of Real Videos. (arXiv:2201.08361v1 [cs.CV])
    (2 min) The ability of Generative Adversarial Networks to encode rich semantics within their latent space has been widely adopted for facial image editing. However, replicating their success with videos has proven challenging. Sets of high-quality facial videos are lacking, and working with videos introduces a fundamental barrier to overcome - temporal coherency. We propose that this barrier is largely artificial. The source video is already temporally coherent, and deviations from this state arise in part due to careless treatment of individual components in the editing pipeline. We leverage the natural alignment of StyleGAN and the tendency of neural networks to learn low frequency functions, and demonstrate that they provide a strongly consistent prior. We draw on these insights and propose a framework for semantic editing of faces in videos, demonstrating significant improvements over the current state-of-the-art. Our method produces meaningful face manipulations, maintains a higher degree of temporal consistency, and can be applied to challenging, high quality, talking head videos which current methods struggle with.
    EdgeMap: CrowdSourcing High Definition Map in Automotive Edge Computing. (arXiv:2201.07973v1 [cs.LG])
    (0 min) High definition (HD) map needs to be updated frequently to capture road changes, which is constrained by limited specialized collection vehicles. To maintain an up-to-date map, we explore crowdsourcing data from connected vehicles. Updating the map collaboratively is, however, challenging under constrained transmission and computation resources in dynamic networks. In this paper, we propose EdgeMap, a crowdsourcing HD map to minimize the usage of network resources while maintaining the latency requirements. We design a DATE algorithm to adaptively offload vehicular data on a small time scale and reserve network resources on a large time scale, by leveraging the multi-agent deep reinforcement learning and Gaussian process regression. We evaluate the performance of EdgeMap with extensive network simulations in a time-driven end-to-end simulator. The results show that EdgeMap reduces more than 30% resource usage as compared to state-of-the-art solutions.
    Node Feature Kernels Increase Graph Convolutional Network Robustness. (arXiv:2109.01785v2 [cs.LG] UPDATED)
    (0 min) The robustness of the much-used Graph Convolutional Networks (GCNs) to perturbations of their input is becoming a topic of increasing importance. In this paper, the random GCN is introduced for which a random matrix theory analysis is possible. This analysis suggests that if the graph is sufficiently perturbed, or in the extreme case random, then the GCN fails to benefit from the node features. It is furthermore observed that enhancing the message passing step in GCNs by adding the node feature kernel to the adjacency matrix of the graph structure solves this problem. An empirical study of a GCN utilised for node classification on six real datasets further confirms the theoretical findings and demonstrates that perturbations of the graph structure can result in GCNs performing significantly worse than Multi-Layer Perceptrons run on the node features alone. In practice, adding a node feature kernel to the message passing of perturbed graphs results in a significant improvement of the GCN's performance, thereby rendering it more robust to graph perturbations. Our code is publicly available at:https://github.com/ChangminWu/RobustGCN.
    Safety-AwareMulti-Agent Apprenticeship Learning. (arXiv:2201.08111v1 [cs.AI])
    (2 min) Our objective of this project is to make the extension based on the technique mentioned in the paper "Safety-Aware Apprenticeship Learning" to improve the utility and the efficiency of the existing Reinforcement Learning model from a Single-Agent Learning framework to a Multi-Agent Learning framework. Our contributions to the project are presented in the following bullet points: 1. Regarding the fact that we will add an extension to the Inverse Reinforcement Learning model from a Single-Agent scenario to a Multi-Agentscenario. Our first contribution to this project is considering the case of extracting safe reward functions from expert behaviors in a Multi-Agent scenario instead of being from the Single-Agent scenario. 2. Our second contribution is extending the Single-Agent Learning Framework to a Multi-Agent Learning framework and designing a novel Learning Framework based on the extension in the end. 3. Our final contribution to this project is evaluating empirically the performance of my extension to the Single-Agent Inverse Reinforcement Learning framework.
    Sentiment Analysis: Predicting Yelp Scores. (arXiv:2201.07999v1 [cs.LG])
    (0 min) In this work, we predict the sentiment of restaurant reviews based on a subset of the Yelp Open Dataset. We utilize the meta features and text available in the dataset and evaluate several machine learning and state-of-the-art deep learning approaches for the prediction task. Through several qualitative experiments, we show the success of the deep models with attention mechanism in learning a balanced model for reviews across different restaurants. Finally, we propose a novel Multi-tasked joint BERT model that improves the overall classification performance.
    Statistical Depth Functions for Ranking Distributions: Definitions, Statistical Learning and Applications. (arXiv:2201.08105v1 [cs.LG])
    (2 min) The concept of median/consensus has been widely investigated in order to provide a statistical summary of ranking data, i.e. realizations of a random permutation $\Sigma$ of a finite set, $\{1,\; \ldots,\; n\}$ with $n\geq 1$ say. As it sheds light onto only one aspect of $\Sigma$'s distribution $P$, it may neglect other informative features. It is the purpose of this paper to define analogs of quantiles, ranks and statistical procedures based on such quantities for the analysis of ranking data by means of a metric-based notion of depth function on the symmetric group. Overcoming the absence of vector space structure on $\mathfrak{S}_n$, the latter defines a center-outward ordering of the permutations in the support of $P$ and extends the classic metric-based formulation of consensus ranking (medians corresponding then to the deepest permutations). The axiomatic properties that ranking depths should ideally possess are listed, while computational and generalization issues are studied at length. Beyond the theoretical analysis carried out, the relevance of the novel concepts and methods introduced for a wide variety of statistical tasks are also supported by numerous numerical experiments.
    Self-Awareness Safety of Deep Reinforcement Learning in Road Traffic Junction Driving. (arXiv:2201.08116v1 [cs.AI])
    (0 min) Autonomous driving has been at the forefront of public interest, and a pivotal debate to widespread concerns is safety in the transportation system. Deep reinforcement learning (DRL) has been applied to autonomous driving to provide solutions for obstacle avoidance. However, in a road traffic junction scenario, the vehicle typically receives partial observations from the transportation environment, while DRL needs to rely on long-term rewards to train a reliable model by maximising the cumulative rewards, which may take the risk when exploring new actions and returning either a positive reward or a penalty in the case of collisions. Although safety concerns are usually considered in the design of a reward function, they are not fully considered as the critical metric to directly evaluate the effectiveness of DRL algorithms in autonomous driving. In this study, we evaluated the safety performance of three baseline DRL models (DQN, A2C, and PPO) and proposed a self-awareness module from an attention mechanism for DRL to improve the safety evaluation for an anomalous vehicle in a complex road traffic junction environment, such as intersection and roundabout scenarios, based on four metrics: collision rate, success rate, freezing rate, and total reward. Our two experimental results in the training and testing phases revealed the baseline DRL with poor safety performance, while our proposed self-awareness attention-DQN can significantly improve the safety performance in intersection and roundabout scenarios.
    Learning Estimates At The Edge Using Intermittent And Aged Measurement Updates. (arXiv:2201.08020v1 [cs.LG])
    (0 min) Cyber Physical Systems (CPS) applications have agents that actuate in their local vicinity, while requiring measurements that capture the state of their larger environment to make actuation choices. These measurements are made by sensors and communicated over a network as update packets. Network resource constraints dictate that updates arrive at an agent intermittently and be aged on their arrival. This can be alleviated by providing an agent with a fast enough rate of estimates of the measurements. Often works on estimation assume knowledge of the dynamic model of the system being measured. However, as CPS applications become pervasive, such information may not be available in practice. In this work, we propose a novel deep neural network architecture that leverages Long Short Term Memory (LSTM) networks to learn estimates in a model-free setting using only updates received over the network. We detail an online algorithm that enables training of our architecture. The architecture is shown to provide good estimates of measurements of both a linear and a non-linear dynamic system. It learns good estimates even when the learning proceeds over a generic network setting in which the distributions that govern the rate and age of received measurements may change significantly over time. We demonstrate the efficacy of the architecture by comparing it with the baselines of the Time-varying Kalman Filter and the Unscented Kalman Filter. The architecture enables empirical insights with regards to maintaining the ages of updates at the estimator, which are used by it and also the baselines.
    A dynamic programming algorithm for finding an optimal sequence of informative measurements. (arXiv:2109.11808v3 [cs.LG] UPDATED)
    (2 min) An informative measurement is the most efficient way to gain information about an unknown state. We give a first-principles derivation of a general-purpose dynamic programming algorithm that returns an optimal sequence of informative measurements by sequentially maximizing the entropy of possible measurement outcomes. This algorithm can be used by an autonomous agent or robot to decide where best to measure next, planning a path corresponding to an optimal sequence of informative measurements. The algorithm is applicable to states and controls that are continuous or discrete, and agent dynamics that is either stochastic or deterministic; including Markov decision processes and Gaussian processes. Recent results from approximate dynamic programming and reinforcement learning, including on-line approximations such as rollout and Monte Carlo tree search, allow the measurement task to be solved in real-time. The resulting solutions include non-myopic paths and measurement sequences that can generally outperform, sometimes substantially, commonly used greedy approaches. This is demonstrated for a global search problem, where on-line planning with an extended local search is found to reduce the number of measurements in the search by approximately half. A variant of the algorithm is derived for Gaussian processes for active sensing.
    Sketch-and-Lift: Scalable Subsampled Semidefinite Program for $K$-means Clustering. (arXiv:2201.08226v1 [stat.ML])
    (0 min) Semidefinite programming (SDP) is a powerful tool for tackling a wide range of computationally hard problems such as clustering. Despite the high accuracy, semidefinite programs are often too slow in practice with poor scalability on large (or even moderate) datasets. In this paper, we introduce a linear time complexity algorithm for approximating an SDP relaxed $K$-means clustering. The proposed sketch-and-lift (SL) approach solves an SDP on a subsampled dataset and then propagates the solution to all data points by a nearest-centroid rounding procedure. It is shown that the SL approach enjoys a similar exact recovery threshold as the $K$-means SDP on the full dataset, which is known to be information-theoretically tight under the Gaussian mixture model. The SL method can be made adaptive with enhanced theoretic properties when the cluster sizes are unbalanced. Our simulation experiments demonstrate that the statistical accuracy of the proposed method outperforms state-of-the-art fast clustering algorithms without sacrificing too much computational efficiency, and is comparable to the original $K$-means SDP with substantially reduced runtime.
    Modeling Baroque Two-Part Counterpoint with Neural Machine Translation. (arXiv:2006.14221v4 [cs.SD] UPDATED)
    (0 min) We propose a system for contrapuntal music generation based on a Neural Machine Translation (NMT) paradigm. We consider Baroque counterpoint and are interested in modeling the interaction between any two given parts as a mapping between a given source material and an appropriate target material. Like in translation, the former imposes some constraints on the latter, but doesn't define it completely. We collate and edit a bespoke dataset of Baroque pieces, use it to train an attention-based neural network model, and evaluate the generated output via BLEU score and musicological analysis. We show that our model is able to respond with some idiomatic trademarks, such as imitation and appropriate rhythmic offset, although it falls short of having learned stylistically correct contrapuntal motion (e.g., avoidance of parallel fifths) or stricter imitative rules, such as canon.
    Exact learning for infinite families of concepts. (arXiv:2201.08225v1 [cs.AI])
    (0 min) In this paper, based on results of exact learning, test theory, and rough set theory, we study arbitrary infinite families of concepts each of which consists of an infinite set of elements and an infinite set of subsets of this set called concepts. We consider the notion of a problem over a family of concepts that is described by a finite number of elements: for a given concept, we should recognize which of the elements under consideration belong to this concept. As algorithms for problem solving, we consider decision trees of five types: (i) using membership queries, (ii) using equivalence queries, (iii) using both membership and equivalence queries, (iv) using proper equivalence queries, and (v) using both membership and proper equivalence queries. As time complexity, we study the depth of decision trees. In the worst case, with the growth of the number of elements in the problem description, the minimum depth of decision trees of the first type either grows as a logarithm or linearly, and the minimum depth of decision trees of each of the other types either is bounded from above by a constant or grows as a logarithm, or linearly. The obtained results allow us to distinguish seven complexity classes of infinite families of concepts.
    Sample Summary with Generative Encoding. (arXiv:2201.08233v1 [cs.LG])
    (0 min) With increasing sample sizes, all algorithms require longer run times that scales at best logarithmically. A concept that summarises the sample space to reduce the total number of samples into a core set that can be used for regression tasks is introduced. This idea of summarisation is called folding - the technique for projecting data into a lower dimensional subspace, whereas unfolding projects it back into the original space. Results for a prediction task show that information is retained during folding as accuracy after unfolding is still comparable to prediction without summarisation.
    Efficient PAC Learning from the Crowd with Pairwise Comparison. (arXiv:2011.01104v3 [cs.LG] UPDATED)
    (0 min) Efficient PAC learning of threshold functions is arguably one of the most important problems in machine learning. With the unprecedented growth of large-scale data sets, it has become ubiquitous to appeal to the crowd wisdom for data annotation, and the central problem that attracts a surge of recent interests is how one can learn the underlying hypothesis from the highly noisy crowd annotation while well-controlling the annotation cost. On the other hand, a large body of recent works have investigated the problem of learning with not only labels, but also pairwise comparisons, since in many applications it is easier to compare than to label. In this paper, we study the problem of PAC learning threshold functions from the crowd, where the annotators can provide (noisy) labels or pairwise comparison tags. We design a label-efficient algorithm that interleaves learning and annotation, which leads to a constant overhead of our algorithm (a notion that characterizes the query complexity). In contrast, a natural approach of annotation followed by learning leads to an overhead growing with the sample size.
    Generalizing Off-Policy Evaluation From a Causal Perspective For Sequential Decision-Making. (arXiv:2201.08262v1 [cs.LG])
    (2 min) Assessing the effects of a policy based on observational data from a different policy is a common problem across several high-stake decision-making domains, and several off-policy evaluation (OPE) techniques have been proposed. However, these methods largely formulate OPE as a problem disassociated from the process used to generate the data (i.e. structural assumptions in the form of a causal graph). We argue that explicitly highlighting this association has important implications on our understanding of the fundamental limits of OPE. First, this implies that current formulation of OPE corresponds to a narrow set of tasks, i.e. a specific causal estimand which is focused on prospective evaluation of policies over populations or sub-populations. Second, we demonstrate how this association motivates natural desiderata to consider a general set of causal estimands, particularly extending the role of OPE for counterfactual off-policy evaluation at the level of individuals of the population. A precise description of the causal estimand highlights which OPE estimands are identifiable from observational data under the stated generative assumptions. For those OPE estimands that are not identifiable, the causal perspective further highlights where more experimental data is necessary, and highlights situations where human expertise can aid identification and estimation. Furthermore, many formalisms of OPE overlook the role of uncertainty entirely in the estimation process.We demonstrate how specifically characterising the causal estimand highlights the different sources of uncertainty and when human expertise can naturally manage this uncertainty. We discuss each of these aspects as actionable desiderata for future OPE research at scale and in-line with practical utility.
    Batch Normalization Preconditioning for Neural Network Training. (arXiv:2108.01110v2 [cs.LG] UPDATED)
    (2 min) Batch normalization (BN) is a popular and ubiquitous method in deep learning that has been shown to decrease training time and improve generalization performance of neural networks. Despite its success, BN is not theoretically well understood. It is not suitable for use with very small mini-batch sizes or online learning. In this paper, we propose a new method called Batch Normalization Preconditioning (BNP). Instead of applying normalization explicitly through a batch normalization layer as is done in BN, BNP applies normalization by conditioning the parameter gradients directly during training. This is designed to improve the Hessian matrix of the loss function and hence convergence during training. One benefit is that BNP is not constrained on the mini-batch size and works in the online learning setting. Furthermore, its connection to BN provides theoretical insights on how BN improves training and how BN is applied to special architectures such as convolutional neural networks. For a theoretical foundation, we also present a novel Hessian condition number based convergence theory for a locally convex but not strong-convex loss, which is applicable to networks with a scale-invariant property.
    NeurWIN: Neural Whittle Index Network For Restless Bandits Via Deep RL. (arXiv:2110.02128v2 [cs.LG] UPDATED)
    (2 min) Whittle index policy is a powerful tool to obtain asymptotically optimal solutions for the notoriously intractable problem of restless bandits. However, finding the Whittle indices remains a difficult problem for many practical restless bandits with convoluted transition kernels. This paper proposes NeurWIN, a neural Whittle index network that seeks to learn the Whittle indices for any restless bandits by leveraging mathematical properties of the Whittle indices. We show that a neural network that produces the Whittle index is also one that produces the optimal control for a set of Markov decision problems. This property motivates using deep reinforcement learning for the training of NeurWIN. We demonstrate the utility of NeurWIN by evaluating its performance for three recently studied restless bandit problems. Our experiment results show that the performance of NeurWIN is significantly better than other RL algorithms.
    Learning-based Hybrid Local Search for the Hard-label Textual Attack. (arXiv:2201.08193v1 [cs.CL])
    (2 min) Deep neural networks are vulnerable to adversarial examples in Natural Language Processing. However, existing textual adversarial attacks usually utilize the gradient or prediction confidence to generate adversarial examples, making it hard to be deployed in real-world applications. To this end, we consider a rarely investigated but more rigorous setting, namely hard-label attack, in which the attacker could only access the prediction label. In particular, we find that the changes on prediction label caused by word substitutions on the adversarial example could precisely reflect the importance of different words. Based on this observation, we propose a novel hard-label attack, called Learning-based Hybrid Local Search (LHLS) algorithm, which effectively estimates word importance with the prediction label from the attack history and integrate such information into hybrid local search algorithm to optimize the adversarial perturbation. Extensive evaluations for text classification and textual entailment using various datasets and models show that our LHLS significantly outperforms existing hard-label attacks regarding the attack performance as well as adversary quality.
    Transfer Learning for Fault Diagnosis of Transmission Lines. (arXiv:2201.08018v1 [cs.LG])
    (2 min) Recent artificial intelligence-based methods have shown great promise in the use of neural networks for real-time sensing and detection of transmission line faults and estimation of their locations. The expansion of power systems including transmission lines with various lengths have made a fault detection, classification, and location estimation process more challenging. Transmission line datasets are stream data which are continuously collected by various sensors and hence, require generalized and fast fault diagnosis approaches. Newly collected datasets including voltages and currents might not have enough and accurate labels (fault and no fault) that are useful to train neural networks. In this paper, a novel transfer learning framework based on a pre-trained LeNet-5 convolutional neural network is proposed. This method is able to diagnose faults for different transmission line lengths and impedances by transferring the knowledge from a source convolutional neural network to predict a dissimilar target dataset. By transferring this knowledge, faults from various transmission lines, without having enough labels, can be diagnosed faster and more efficiently compared to the existing methods. To prove the feasibility and effectiveness of this methodology, seven different datasets that include various lengths of transmission lines are used. The robustness of the proposed methodology against generator voltage fluctuation, variation in fault distance, fault inception angle, fault resistance, and phase difference between the two generators are well shown, thus proving its practical values in the fault diagnosis of transmission lines.
    Lifelong Learning Metrics. (arXiv:2201.08278v1 [cs.AI])
    (2 min) The DARPA Lifelong Learning Machines (L2M) program seeks to yield advances in artificial intelligence (AI) systems so that they are capable of learning (and improving) continuously, leveraging data on one task to improve performance on another, and doing so in a computationally sustainable way. Performers on this program developed systems capable of performing a diverse range of functions, including autonomous driving, real-time strategy, and drone simulation. These systems featured a diverse range of characteristics (e.g., task structure, lifetime duration), and an immediate challenge faced by the program's testing and evaluation team was measuring system performance across these different settings. This document, developed in close collaboration with DARPA and the program performers, outlines a formalism for constructing and characterizing the performance of agents performing lifelong learning scenarios.
    Fairness in Federated Learning for Spatial-Temporal Applications. (arXiv:2201.06598v2 [cs.LG] UPDATED)
    (2 min) Federated learning involves training statistical models over remote devices such as mobile phones while keeping data localized. Training in heterogeneous and potentially massive networks introduces opportunities for privacy-preserving data analysis and diversifying these models to become more inclusive of the population. Federated learning can be viewed as a unique opportunity to bring fairness and parity to many existing models by enabling model training to happen on a diverse set of participants and on data that is generated regularly and dynamically. In this paper, we discuss the current metrics and approaches that are available to measure and evaluate fairness in the context of spatial-temporal models. We propose how these metrics and approaches can be re-defined to address the challenges that are faced in the federated learning setting.
    OpenIPDM: A Probabilistic Framework for Estimating the Deterioration and Effect of Interventions on Bridges. (arXiv:2201.08254v1 [eess.SY])
    (0 min) This paper describes OpenIPDM software for modelling the deterioration process of infrastructures using network-scale visual inspection data. In addition to the deterioration state estimates, OpenIPDM provides functions for quantifying the effect of interventions, estimating the service life of an intervention, and generating synthetic data for verification purposes. Each of the aforementioned functions are accessible by an interactive graphical user interface. OpenIPDM is designed based on the research work done on a network of bridges in Quebec province, so that the concepts presented in the software have been validated for applications in a real-world context. In addition, this software provides foundations for future developments in the subject area of modelling the deterioration as well as intervention planning.
    Cognitive Ledger Project: Towards Building Personal Digital Twins Through Cognitive Blockchain. (arXiv:2201.08163v1 [cs.AI])
    (2 min) The Cognitive Ledger Project is an effort to develop a modular system for turning users' personal data into structured information and machine learning models based on a blockchain-based infrastructure. In this work-in-progress paper, we propose a cognitive architecture for cognitive digital twins. The suggested design embraces a cognitive blockchain (Cognitive ledger) at its core. The architecture includes several modules that turn users' activities in the digital environment into reusable knowledge objects and artificial intelligence that one day can work together to form the cognitive digital twin of users.
    Mesh convolutional neural networks for wall shear stress estimation in 3D artery models. (arXiv:2109.04797v3 [cs.LG] UPDATED)
    (0 min) Computational fluid dynamics (CFD) is a valuable tool for personalised, non-invasive evaluation of hemodynamics in arteries, but its complexity and time-consuming nature prohibit large-scale use in practice. Recently, the use of deep learning for rapid estimation of CFD parameters like wall shear stress (WSS) on surface meshes has been investigated. However, existing approaches typically depend on a hand-crafted re-parametrisation of the surface mesh to match convolutional neural network architectures. In this work, we propose to instead use mesh convolutional neural networks that directly operate on the same finite-element surface mesh as used in CFD. We train and evaluate our method on two datasets of synthetic coronary artery models with and without bifurcation, using a ground truth obtained from CFD simulation. We show that our flexible deep learning model can accurately predict 3D WSS vectors on this surface mesh. Our method processes new meshes in less than 5 [s], consistently achieves a normalised mean absolute error of $\leq$ 1.6 [%], and peaks at 90.5 [%] median approximation accuracy over the held-out test set, comparing favourably to previously published work. This demonstrates the feasibility of CFD surrogate modelling using mesh convolutional neural networks for hemodynamic parameter estimation in artery models.
    Accelerated Gradient Flow: Risk, Stability, and Implicit Regularization. (arXiv:2201.08311v1 [stat.ML])
    (0 min) Acceleration and momentum are the de facto standard in modern applications of machine learning and optimization, yet the bulk of the work on implicit regularization focuses instead on unaccelerated methods. In this paper, we study the statistical risk of the iterates generated by Nesterov's accelerated gradient method and Polyak's heavy ball method, when applied to least squares regression, drawing several connections to explicit penalization. We carry out our analyses in continuous-time, allowing us to make sharper statements than in prior work, and revealing complex interactions between early stopping, stability, and the curvature of the loss function.
    On Regularizability and its Application to Online Control of Unstable LTI Systems. (arXiv:2006.00125v3 [eess.SY] UPDATED)
    (0 min) Learning, say through direct policy updates, often requires assumptions such as knowing a priori that the initial policy (gain) is stabilizing, or persistently exciting (PE) input-output data, is available. In this paper, we examine online regulation of (possibly unstable) partially unknown linear systems with no prior access to an initial stabilizing controller nor PE input-output data; we instead leverage the knowledge of the input matrix for online regulation. First, we introduce and characterize the notion of "regularizability" for linear systems that gauges the extent by which a system can be regulated in finite-time in contrast to its asymptotic behavior (commonly characterized by stabilizability/controllability). Next, having access only to the input matrix, we propose the Data-Guided Regulation (DGR) synthesis procedure that -- as its name suggests -- regulates the underlying state while also generating informative data that can subsequently be used for data-driven stabilization or system identification. We further improve the computational performance of DGR via a rank-one update and demonstrate its utility in online regulation of the X-29 aircraft.
    HIST: A Graph-based Framework for Stock Trend Forecasting via Mining Concept-Oriented Shared Information. (arXiv:2110.13716v2 [q-fin.ST] UPDATED)
    (2 min) Stock trend forecasting, which forecasts stock prices' future trends, plays an essential role in investment. The stocks in a market can share information so that their stock prices are highly correlated. Several methods were recently proposed to mine the shared information through stock concepts (e.g., technology, Internet Retail) extracted from the Web to improve the forecasting results. However, previous work assumes the connections between stocks and concepts are stationary, and neglects the dynamic relevance between stocks and concepts, limiting the forecasting results. Moreover, existing methods overlook the invaluable shared information carried by hidden concepts, which measure stocks' commonness beyond the manually defined stock concepts. To overcome the shortcomings of previous work, we proposed a novel stock trend forecasting framework that can adequately mine the concept-oriented shared information from predefined concepts and hidden concepts. The proposed framework simultaneously utilize the stock's shared information and individual information to improve the stock trend forecasting performance. Experimental results on the real-world tasks demonstrate the efficiency of our framework on stock trend forecasting. The investment simulation shows that our framework can achieve a higher investment return than the baselines.
    Predictive Inference with Weak Supervision. (arXiv:2201.08315v1 [stat.ML])
    (2 min) The expense of acquiring labels in large-scale statistical machine learning makes partially and weakly-labeled data attractive, though it is not always apparent how to leverage such data for model fitting or validation. We present a methodology to bridge the gap between partial supervision and validation, developing a conformal prediction framework to provide valid predictive confidence sets -- sets that cover a true label with a prescribed probability, independent of the underlying distribution -- using weakly labeled data. To do so, we introduce a (necessary) new notion of coverage and predictive validity, then develop several application scenarios, providing efficient algorithms for classification and several large-scale structured prediction problems. We corroborate the hypothesis that the new coverage definition allows for tighter and more informative (but valid) confidence sets through several experiments.
    Minimax Demographic Group Fairness in Federated Learning. (arXiv:2201.08304v1 [cs.LG])
    (2 min) Federated learning is an increasingly popular paradigm that enables a large number of entities to collaboratively learn better models. In this work, we study minimax group fairness in federated learning scenarios where different participating entities may only have access to a subset of the population groups during the training phase. We formally analyze how our proposed group fairness objective differs from existing federated learning fairness criteria that impose similar performance across participants instead of demographic groups. We provide an optimization algorithm -- FedMinMax -- for solving the proposed problem that provably enjoys the performance guarantees of centralized learning algorithms. We experimentally compare the proposed approach against other state-of-the-art methods in terms of group fairness in various federated learning setups, showing that our approach exhibits competitive or superior performance.
    Factor-augmented tree ensembles. (arXiv:2111.14000v2 [stat.ML] UPDATED)
    (2 min) This manuscript proposes to extend the information set of time-series regression trees with latent stationary factors extracted via state-space methods. In doing so, this approach generalises time-series regression trees on two dimensions. First, it allows to handle predictors that exhibit measurement error, non-stationary trends, seasonality and/or irregularities such as missing observations. Second, it gives a transparent way for using domain-specific theory to inform time-series regression trees. As a byproduct, this technique sets the foundations for structuring powerful ensembles. Their real-world applicability is studied under the lenses of empirical macro-finance.
    Understanding Agent Incentives using Causal Influence Diagrams. Part I: Single Action Settings. (arXiv:1902.09980v7 [cs.AI] UPDATED)
    (2 min) Agents are systems that optimize an objective function in an environment. Together, the goal and the environment induce secondary objectives, incentives. Modeling the agent-environment interaction using causal influence diagrams, we can answer two fundamental questions about an agent's incentives directly from the graph: (1) which nodes can the agent have an incentivize to observe, and (2) which nodes can the agent have an incentivize to control? The answers tell us which information and influence points need extra protection. For example, we may want a classifier for job applications to not use the ethnicity of the candidate, and a reinforcement learning agent not to take direct control of its reward mechanism. Different algorithms and training paradigms can lead to different causal influence diagrams, so our method can be used to identify algorithms with problematic incentives and help in designing algorithms with better incentives.
    MOHAQ: Multi-Objective Hardware-Aware Quantization of Recurrent Neural Networks. (arXiv:2108.01192v3 [cs.LG] UPDATED)
    (2 min) The compression of deep learning models is of fundamental importance in deploying such models to edge devices. The selection of compression parameters can be automated to meet changes in the hardware platform and application using optimization algorithms. This article introduces a Multi-Objective Hardware-Aware Quantization (MOHAQ) method, which considers hardware efficiency and inference error as objectives for mixed-precision quantization. The proposed method feasibly evaluates candidate solutions in a large search space by relying on two steps. First, post-training quantization is applied for fast solution evaluation (inference-only search). Second, we propose the "beacon-based search" to retrain selected solutions only and use them as beacons to know the effect of retraining on other solutions. We use a speech recognition model based on Simple Recurrent Unit (SRU) using the TIMIT dataset and apply our method to run on SiLago and Bitfusion platforms. We provide experimental evaluations showing that SRU can be compressed up to 8x by post-training quantization without any significant error increase. On SiLago, we found solutions that achieve 97\% and 86\% of the maximum possible speedup and energy saving, with a minor increase in error. On Bitfusion, beacon-based search reduced the error gain of inference-only search by up to 4.9 percentage points.
    Priors, Hierarchy, and Information Asymmetry for Skill Transfer in Reinforcement Learning. (arXiv:2201.08115v1 [cs.AI])
    (2 min) The ability to discover behaviours from past experience and transfer them to new tasks is a hallmark of intelligent agents acting sample-efficiently in the real world. Equipping embodied reinforcement learners with the same ability may be crucial for their successful deployment in robotics. While hierarchical and KL-regularized RL individually hold promise here, arguably a hybrid approach could combine their respective benefits. Key to these fields is the use of information asymmetry to bias which skills are learnt. While asymmetric choice has a large influence on transferability, prior works have explored a narrow range of asymmetries, primarily motivated by intuition. In this paper, we theoretically and empirically show the crucial trade-off, controlled by information asymmetry, between the expressivity and transferability of skills across sequential tasks. Given this insight, we provide a principled approach towards choosing asymmetry and apply our approach to a complex, robotic block stacking domain, unsolvable by baselines, demonstrating the effectiveness of hierarchical KL-regularized RL, coupled with correct asymmetric choice, for sample-efficient transfer learning.
    Meta Learning for Code Summarization. (arXiv:2201.08310v1 [cs.LG])
    (2 min) Source code summarization is the task of generating a high-level natural language description for a segment of programming language code. Current neural models for the task differ in their architecture and the aspects of code they consider. In this paper, we show that three SOTA models for code summarization work well on largely disjoint subsets of a large code-base. This complementarity motivates model combination: We propose three meta-models that select the best candidate summary for a given code segment. The two neural models improve significantly over the performance of the best individual model, obtaining an improvement of 2.1 BLEU points on a dataset of code segments where at least one of the individual models obtains a non-zero BLEU.
    Multi-agent Natural Actor-critic Reinforcement Learning Algorithms. (arXiv:2109.01654v2 [cs.LG] UPDATED)
    (3 min) Multi-agent actor-critic algorithms are an important part of the Reinforcement Learning paradigm. We propose three fully decentralized multi-agent natural actor-critic (MAN) algorithms in this work. The objective is to collectively find a joint policy that maximizes the average long-term return of these agents. In the absence of a central controller and to preserve privacy, agents communicate some information to their neighbors via a time-varying communication network. We prove convergence of all the 3 MAN algorithms to a globally asymptotically stable set of the ODE corresponding to actor update; these use linear function approximations. We show that the Kullback-Leibler divergence between policies of successive iterates is proportional to the objective function's gradient. We observe that the minimum singular value of the Fisher information matrix is well within the reciprocal of the policy parameter dimension. Using this, we theoretically show that the optimal value of the deterministic variant of the MAN algorithm at each iterate dominates that of the standard gradient-based multi-agent actor-critic (MAAC) algorithm. To our knowledge, it is a first such result in multi-agent reinforcement learning (MARL). To illustrate the usefulness of our proposed algorithms, we implement them on a bi-lane traffic network to reduce the average network congestion. We observe an almost 25\% reduction in the average congestion in 2 MAN algorithms; the average congestion in another MAN algorithm is on par with the MAAC algorithm. We also consider a generic $15$ agent MARL; the performance of the MAN algorithms is again as good as the MAAC algorithm.
    Physics-informed neural networks for modeling rate- and temperature-dependent plasticity. (arXiv:2201.08363v1 [cond-mat.mtrl-sci])
    (2 min) This work presents a physics-informed neural network based framework to model the strain-rate and temperature dependence of the deformation fields (displacement, stress, plastic strain) in elastic-viscoplastic solids. A detailed discussion on the construction of the physics-based loss criterion along with a brief outline on ways to avoid unbalanced back-propagated gradients during training is also presented. We also present a simple strategy with no added computational complexity for choosing scalar weights that balance the interplay between different terms in the composite loss. Moreover, we also highlight a fundamental challenge involving selection of appropriate model outputs so that the mechanical problem can be faithfully solved using neural networks. Finally, the effectiveness of the proposed framework is demonstrated by studying two test problems modeling the elastic-viscoplastic deformation in solids at different strain-rates and temperatures, respectively.
    Analyzing Data Selection Techniques with Tools from the Theory of Information Losses. (arXiv:1902.09602v4 [cs.LG] UPDATED)
    (2 min) In this paper, we present and illustrate some new tools for rigorously analyzing training data selection methods. These tools focus on the information theoretic losses that occur when sampling data. We use this framework to prove that two methods, Facility Location Selection and Transductive Experimental Design, reduce these losses. These are meant to act as generalizable theoretical examples of applying the field of Information Theoretic Deep Learning Theory to the fields of data selection and active learning. Both analyses yield insight into their respective methods and increase their interpretability. In the case of Transductive Experimental Design, the provided analysis greatly increases the method's scope as well.
    Survey on Federated Learning Threats: concepts, taxonomy on attacks and defences, experimental study and challenges. (arXiv:2201.08135v1 [cs.CR])
    (2 min) Federated learning is a machine learning paradigm that emerges as a solution to the privacy-preservation demands in artificial intelligence. As machine learning, federated learning is threatened by adversarial attacks against the integrity of the learning model and the privacy of data via a distributed approach to tackle local and global learning. This weak point is exacerbated by the inaccessibility of data in federated learning, which makes harder the protection against adversarial attacks and evidences the need to furtherance the research on defence methods to make federated learning a real solution for safeguarding data privacy. In this paper, we present an extensive review of the threats of federated learning, as well as as their corresponding countermeasures, attacks versus defences. This survey provides a taxonomy of adversarial attacks and a taxonomy of defence methods that depict a general picture of this vulnerability of federated learning and how to overcome it. Likewise, we expound guidelines for selecting the most adequate defence method according to the category of the adversarial attack. Besides, we carry out an extensive experimental study from which we draw further conclusions about the behaviour of attacks and defences and the guidelines for selecting the most adequate defence method according to the category of the adversarial attack. This study is finished leading to meditated learned lessons and challenges.
    Technologies for Trustworthy Machine Learning: A Survey in a Socio-Technical Context. (arXiv:2007.08911v3 [cs.LG] UPDATED)
    (2 min) Concerns about the societal impact of AI-based services and systems has encouraged governments and other organisations around the world to propose AI policy frameworks to address fairness, accountability, transparency and related topics. To achieve the objectives of these frameworks, the data and software engineers who build machine-learning systems require knowledge about a variety of relevant supporting tools and techniques. In this paper we provide an overview of technologies that support building trustworthy machine learning systems, i.e., systems whose properties justify that people place trust in them. We argue that four categories of system properties are instrumental in achieving the policy objectives, namely fairness, explainability, auditability and safety & security (FEAS). We discuss how these properties need to be considered across all stages of the machine learning life cycle, from data collection through run-time model inference. As a consequence, we survey in this paper the main technologies with respect to all four of the FEAS properties, for data-centric as well as model-centric stages of the machine learning system life cycle. We conclude with an identification of open research problems, with a particular focus on the connection between trustworthy machine learning technologies and their implications for individuals and society.
    Cluster-based Input Weight Initialization for Echo State Networks. (arXiv:2103.04710v3 [cs.LG] UPDATED)
    (2 min) Echo State Networks (ESNs) are a special type of recurrent neural networks (RNNs), in which the input and recurrent connections are traditionally generated randomly, and only the output weights are trained. Despite the recent success of ESNs in various tasks of audio, image and radar recognition, we postulate that a purely random initialization is not the ideal way of initializing ESNs. The aim of this work is to propose an unsupervised initialization of the input connections using the $K$-Means algorithm on the training data. We show that for a large variety of datasets this initialization performs equivalently or superior than a randomly initialized ESN whilst needing significantly less reservoir neurons. Furthermore, we discuss that this approach provides the opportunity to estimate a suitable size of the reservoir based on prior knowledge about the data.
    Adaptive neighborhood Metric learning. (arXiv:2201.08314v1 [cs.LG])
    (2 min) In this paper, we reveal that metric learning would suffer from serious inseparable problem if without informative sample mining. Since the inseparable samples are often mixed with hard samples, current informative sample mining strategies used to deal with inseparable problem may bring up some side-effects, such as instability of objective function, etc. To alleviate this problem, we propose a novel distance metric learning algorithm, named adaptive neighborhood metric learning (ANML). In ANML, we design two thresholds to adaptively identify the inseparable similar and dissimilar samples in the training procedure, thus inseparable sample removing and metric parameter learning are implemented in the same procedure. Due to the non-continuity of the proposed ANML, we develop an ingenious function, named \emph{log-exp mean function} to construct a continuous formulation to surrogate it, which can be efficiently solved by the gradient descent method. Similar to Triplet loss, ANML can be used to learn both the linear and deep embeddings. By analyzing the proposed method, we find it has some interesting properties. For example, when ANML is used to learn the linear embedding, current famous metric learning algorithms such as the large margin nearest neighbor (LMNN) and neighbourhood components analysis (NCA) are the special cases of the proposed ANML by setting the parameters different values. When it is used to learn deep features, the state-of-the-art deep metric learning algorithms such as Triplet loss, Lifted structure loss, and Multi-similarity loss become the special cases of ANML. Furthermore, the \emph{log-exp mean function} proposed in our method gives a new perspective to review the deep metric learning methods such as Prox-NCA and N-pairs loss. At last, promising experimental results demonstrate the effectiveness of the proposed method.
    Multi-agent Covering Option Discovery based on Kronecker Product of Factor Graphs. (arXiv:2201.08227v1 [cs.MA])
    (2 min) Covering option discovery has been developed to improve the exploration of reinforcement learning in single-agent scenarios with sparse reward signals, through connecting the most distant states in the embedding space provided by the Fiedler vector of the state transition graph. However, these option discovery methods cannot be directly extended to multi-agent scenarios, since the joint state space grows exponentially with the number of agents in the system. Thus, existing researches on adopting options in multi-agent scenarios still rely on single-agent option discovery and fail to directly discover the joint options that can improve the connectivity of the joint state space of agents. In this paper, we show that it is indeed possible to directly compute multi-agent options with collaborative exploratory behaviors among the agents, while still enjoying the ease of decomposition. Our key idea is to approximate the joint state space as a Kronecker graph -- the Kronecker product of individual agents' state transition graphs, based on which we can directly estimate the Fiedler vector of the joint state space using the Laplacian spectrum of individual agents' transition graphs. This decomposition enables us to efficiently construct multi-agent joint options by encouraging agents to connect the sub-goal joint states which are corresponding to the minimum or maximum values of the estimated joint Fiedler vector. The evaluation based on multi-agent collaborative tasks shows that the proposed algorithm can successfully identify multi-agent options, and significantly outperforms prior works using single-agent options or no options, in terms of both faster exploration and higher cumulative rewards.
    Sim-to-Lab-to-Real: Safe Reinforcement Learning with Shielding and Generalization Guarantees. (arXiv:2201.08355v1 [cs.RO])
    (2 min) Safety is a critical component of autonomous systems and remains a challenge for learning-based policies to be utilized in the real world. In particular, policies learned using reinforcement learning often fail to generalize to novel environments due to unsafe behavior. In this paper, we propose Sim-to-Lab-to-Real to safely close the reality gap. To improve safety, we apply a dual policy setup where a performance policy is trained using the cumulative task reward and a backup (safety) policy is trained by solving the reach-avoid Bellman Equation based on Hamilton-Jacobi reachability analysis. In Sim-to-Lab transfer, we apply a supervisory control scheme to shield unsafe actions during exploration; in Lab-to-Real transfer, we leverage the Probably Approximately Correct (PAC)-Bayes framework to provide lower bounds on the expected performance and safety of policies in unseen environments. We empirically study the proposed framework for ego-vision navigation in two types of indoor environments including a photo-realistic one. We also demonstrate strong generalization performance through hardware experiments in real indoor spaces with a quadrupedal robot. See https://sites.google.com/princeton.edu/sim-to-lab-to-real for supplementary material.
    Time Series Forecasting via Learning Convolutionally Low-Rank Models. (arXiv:2104.11510v5 [cs.LG] UPDATED)
    (2 min) Recently, Liu and Zhang studied the rather challenging problem of time series forecasting from the perspective of compressed sensing. They proposed a no-learning method, named Convolution Nuclear Norm Minimization (CNNM), and proved that CNNM can exactly recover the future part of a series from its observed part, provided that the series is convolutionally low-rank. While impressive, the convolutional low-rankness condition may not be satisfied whenever the series is far from being seasonal, and is in fact brittle to the presence of trends and dynamics. This paper tries to approach the issues by integrating a learnable, orthonormal transformation into CNNM, with the purpose for converting the series of involute structures into regular signals of convolutionally low-rank. We prove that the resultant model, termed Learning-Based CNNM (LbCNNM), strictly succeeds in identifying the future part of a series, as long as the transform of the series is convolutionally low-rank. To learn proper transformations that may meet the required success conditions, we devise an interpretable method based on Principal Component Pursuit (PCP). Equipped with this learning method and some elaborate data argumentation skills, LbCNNM not only can handle well the major components of time series (including trends, seasonality and dynamics), but also can make use of the forecasts provided by some other forecasting methods; this means LbCNNM can be used as a general tool for model combination. Extensive experiments on 100,452 real-world time series from Time Series Data Library (TSDL) and M4 Competition (M4) demonstrate the superior performance of LbCNNM.
    Dataset Structural Index: Leveraging a machine's perspective towards visual data. (arXiv:2110.04070v2 [cs.CV] UPDATED)
    (2 min) With advances in vision and perception architectures, we have realized that working with data is equally crucial, if not more, than the algorithms. Till today, we have trained machines based on our knowledge and perspective of the world. The entire concept of Dataset Structural Index(DSI) revolves around understanding a machine`s perspective of the dataset. With DSI, I show two meta values with which we can get more information over a visual dataset and use it to optimize data, create better architectures, and have an ability to guess which model would work best. These two values are the Variety contribution ratio and Similarity matrix. In the paper, I show many applications of DSI, one of which is how the same level of accuracy can be achieved with the same model architectures trained over less amount of data.
    Symplectic Momentum Neural Networks -- Using Discrete Variational Mechanics as a prior in Deep Learning. (arXiv:2201.08281v1 [cs.LG])
    (2 min) With deep learning being gaining attention from the research community for prediction and control of real physical systems, learning important representations is becoming now more than ever mandatory. It is of extremely importance that deep learning representations are coherent with physics. When learning from discrete data this can be guaranteed by including some sort of prior into the learning, however not all discretization priors preserve important structures from the physics. In this paper we introduce Symplectic Momentum Neural Networks (SyMo) as models from a discrete formulation of mechanics for non-separable mechanical systems. The combination of such formulation leads SyMos to be constrained towards preserving important geometric structures such as momentum and a symplectic form and learn from limited data. Furthermore, it allows to learn dynamics only from the poses as training data. We extend SyMos to include variational integrators within the learning framework by developing an implicit root-find layer which leads to End-to-End Symplectic Momentum Neural Networks (E2E-SyMo). Through experimental results, using the pendulum and cartpole we show that such combination not only allows these models tol earn from limited data but also provides the models with the capability of preserving the symplectic form and show better long-term behaviour.
    Sparsifying the Update Step in Graph Neural Networks. (arXiv:2109.00909v2 [cs.LG] UPDATED)
    (2 min) Message-Passing Neural Networks (MPNNs), the most prominent Graph Neural Network (GNN) framework, celebrate much success in the analysis of graph-structured data. Concurrently, the sparsification of Neural Network models attracts a great amount of academic and industrial interest. In this paper, we conduct a structured study of the effect of sparsification on the trainable part of MPNNs known as the Update step. To this end, we design a series of models to successively sparsify the linear transform in the Update step. Specifically, we propose the ExpanderGNN model with a tuneable sparsification rate and the Activation-Only GNN, which has no linear transform in the Update step. In agreement with a growing trend in the literature, the sparsification paradigm is changed by initialising sparse neural network architectures rather than expensively sparsifying already trained architectures. Our novel benchmark models enable a better understanding of the influence of the Update step on model performance and outperform existing simplified benchmark models such as the Simple Graph Convolution. The ExpanderGNNs, and in some cases the Activation-Only models, achieve performance on par with their vanilla counterparts on several downstream tasks while containing significantly fewer trainable parameters. In experiments with matching parameter numbers, our benchmark models outperform the state-of-the-art GNN models. Our code is publicly available at: https://github.com/ChangminWu/ExpanderGNN.
    Temporal Knowledge Graph Completion: A Survey. (arXiv:2201.08236v1 [cs.AI])
    (2 min) Knowledge graph completion (KGC) can predict missing links and is crucial for real-world knowledge graphs, which widely suffer from incompleteness. KGC methods assume a knowledge graph is static, but that may lead to inaccurate prediction results because many facts in the knowledge graphs change over time. Recently, emerging methods have shown improved predictive results by further incorporating the timestamps of facts; namely, temporal knowledge graph completion (TKGC). With this temporal information, TKGC methods can learn the dynamic evolution of the knowledge graph that KGC methods fail to capture. In this paper, for the first time, we summarize the recent advances in TKGC research. First, we detail the background of TKGC, including the problem definition, benchmark datasets, and evaluation metrics. Then, we summarize existing TKGC methods based on how timestamps of facts are used to capture the temporal dynamics. Finally, we conclude the paper and present future research directions of TKGC.
    Revisiting Memory Efficient Kernel Approximation: An Indefinite Learning Perspective. (arXiv:2112.09893v2 [cs.LG] UPDATED)
    (2 min) Matrix approximations are a key element in large-scale algebraic machine learning approaches. The recently proposed method MEKA (Si et al., 2014) effectively employs two common assumptions in Hilbert spaces: the low-rank property of an inner product matrix obtained from a shift-invariant kernel function and a data compactness hypothesis by means of an inherent block-cluster structure. In this work, we extend MEKA to be applicable not only for shift-invariant kernels but also for non-stationary kernels like polynomial kernels and an extreme learning kernel. We also address in detail how to handle non-positive semi-definite kernel functions within MEKA, either caused by the approximation itself or by the intentional use of general kernel functions. We present a Lanczos-based estimation of a spectrum shift to develop a stable positive semi-definite MEKA approximation, also usable in classical convex optimization frameworks. Furthermore, we support our findings with theoretical considerations and a variety of experiments on synthetic and real-world data.
    Applicability of Large Corporate Credit Models to Small Business Risk Assessment. (arXiv:2201.08276v1 [q-fin.GN])
    (2 min) There is a massive underserved market for small business lending in the US with the Federal Reserve estimating over \$650B in unmet annual financing needs. Assessing the credit risk of a small business is key to making good decisions whether to lend and at what terms. Large corporations have a well-established credit assessment ecosystem, but small businesses suffer from limited publicly available data and few (if any) credit analysts who cover them closely. We explore the applicability of (DL-based) large corporate credit risk models to small business credit rating.
    Data Augmentation for Bayesian Deep Learning. (arXiv:1903.09668v2 [stat.ML] UPDATED)
    (2 min) Deep Learning (DL) methods have emerged as one of the most powerful tools for functional approximation and prediction. While the representation properties of DL have been well studied, uncertainty quantification remains challenging and largely unexplored. Data augmentation techniques are a natural approach to provide uncertainty quantification and to integrate stochastic MCMC search with stochastic gradient descent (SGD) methods. The purpose of our paper is to show that training DL architectures with data augmentation leads to efficiency gains. To demonstrate our methodology, we develop data augmentation algorithms for a variety of commonly used activation functions: logit, ReLU and SVM. Our methodology is compared with traditional stochastic gradient descent with back-propagation. Our optimization procedure leads to a version of iteratively re-weighted least squares and can be implemented at scale with accelerated linear algebra methods providing substantial performance improvement. We illustrate our methodology on a number of standard datasets. Finally, we conclude with directions for future research.
    Quantum Discriminator for Binary Classification. (arXiv:2009.01235v2 [quant-ph] UPDATED)
    (2 min) Quantum computers have the unique ability to operate relatively quickly in high-dimensional spaces -- this is sought to give them a competitive advantage over classical computers. In this work, we propose a novel quantum machine learning model called the Quantum Discriminator, which leverages the ability of quantum computers to operate in the high-dimensional spaces. The quantum discriminator is trained using a quantum-classical hybrid algorithm in O(N logN) time, and inferencing is performed on a universal quantum computer in linear time. The quantum discriminator takes as input the binary features extracted from a given datum along with a prediction qubit initialized to the zero state and outputs the predicted label. We analyze its performance on the Iris data set and show that the quantum discriminator can attain 99% accuracy in simulation.
    Learning with latent group sparsity via heat flow dynamics on networks. (arXiv:2201.08326v1 [stat.ME])
    (2 min) Group or cluster structure on explanatory variables in machine learning problems is a very general phenomenon, which has attracted broad interest from practitioners and theoreticians alike. In this work we contribute an approach to learning under such group structure, that does not require prior information on the group identities. Our paradigm is motivated by the Laplacian geometry of an underlying network with a related community structure, and proceeds by directly incorporating this into a penalty that is effectively computed via a heat flow-based local network dynamics. In fact, we demonstrate a procedure to construct such a network based on the available data. Notably, we dispense with computationally intensive pre-processing involving clustering of variables, spectral or otherwise. Our technique is underpinned by rigorous theorems that guarantee its effective performance and provide bounds on its sample complexity. In particular, in a wide range of settings, it provably suffices to run the heat flow dynamics for time that is only logarithmic in the problem dimensions. We explore in detail the interfaces of our approach with key statistical physics models in network science, such as the Gaussian Free Field and the Stochastic Block Model. We validate our approach by successful applications to real-world data from a wide array of application domains, including computer science, genetics, climatology and economics. Our work raises the possibility of applying similar diffusion-based techniques to classical learning tasks, exploiting the interplay between geometric, dynamical and stochastic structures underlying the data.
    Goal-Conditioned Reinforcement Learning: Problems and Solutions. (arXiv:2201.08299v1 [cs.AI])
    (2 min) Goal-conditioned reinforcement learning (GCRL), related to a set of complex RL problems, trains an agent to achieve different goals under particular scenarios. Compared to the standard RL solutions that learn a policy solely depending on the states or observations, GCRL additionally requires the agent to make decisions according to different goals. In this survey, we provide a comprehensive overview of the challenges and algorithms for GCRL. Firstly, we answer what the basic problems are studied in this field. Then, we explain how goals are represented and present how existing solutions are designed from different points of view. Finally, we make the conclusion and discuss potential future prospects that recent researches focus on.
    S-Extension Patch: A simple and efficient way to extend an object detection model. (arXiv:2110.02670v2 [cs.CV] UPDATED)
    (2 min) While building convolutional network-based systems, the toll it takes to train the network is something that cannot be ignored. In cases where we need to append additional capabilities to the existing model, the attention immediately goes towards retraining techniques. In this paper, I show how to leverage knowledge about the dataset to append the class faster while maintaining the speed of inference as well as the accuracies; while reducing the amount of time and data required. The method can extend a class in the existing object detection model in 1/10th of the time compared to the other existing methods. S-Extension patch not only offers faster training but also speed and ease of adaptation, as it can be appended to any existing system, given it fulfills the similarity threshold condition.
    ASL Video Corpora & Sign Bank: Resources Available through the American Sign Language Linguistic Research Project (ASLLRP). (arXiv:2201.07899v1 [cs.CL])
    (2 min) The American Sign Language Linguistic Research Project (ASLLRP) provides Internet access to high-quality ASL video data, generally including front and side views and a close-up of the face. The manual and non-manual components of the signing have been linguistically annotated using SignStream(R). The recently expanded video corpora can be browsed and searched through the Data Access Interface (DAI 2) we have designed; it is possible to carry out complex searches. The data from our corpora can also be downloaded; annotations are available in an XML export format. We have also developed the ASLLRP Sign Bank, which contains almost 6,000 sign entries for lexical signs, with distinct English-based glosses, with a total of 41,830 examples of lexical signs (in addition to about 300 gestures, over 1,000 fingerspelled signs, and 475 classifier examples). The Sign Bank is likewise accessible and searchable on the Internet; it can also be accessed from within SignStream(R) (software to facilitate linguistic annotation and analysis of visual language data) to make annotations more accurate and efficient. Here we describe the available resources. These data have been used for many types of research in linguistics and in computer-based sign language recognition from video; examples of such research are provided in the latter part of this article.
    Decoupling the Depth and Scope of Graph Neural Networks. (arXiv:2201.07858v1 [cs.LG])
    (2 min) State-of-the-art Graph Neural Networks (GNNs) have limited scalability with respect to the graph and model sizes. On large graphs, increasing the model depth often means exponential expansion of the scope (i.e., receptive field). Beyond just a few layers, two fundamental challenges emerge: 1. degraded expressivity due to oversmoothing, and 2. expensive computation due to neighborhood explosion. We propose a design principle to decouple the depth and scope of GNNs -- to generate representation of a target entity (i.e., a node or an edge), we first extract a localized subgraph as the bounded-size scope, and then apply a GNN of arbitrary depth on top of the subgraph. A properly extracted subgraph consists of a small number of critical neighbors, while excluding irrelevant ones. The GNN, no matter how deep it is, smooths the local neighborhood into informative representation rather than oversmoothing the global graph into "white noise". Theoretically, decoupling improves the GNN expressive power from the perspectives of graph signal processing (GCN), function approximation (GraphSAGE) and topological learning (GIN). Empirically, on seven graphs (with up to 110M nodes) and six backbone GNN architectures, our design achieves significant accuracy improvement with orders of magnitude reduction in computation and hardware cost.
    Cross-Domain Few-Shot Graph Classification. (arXiv:2201.08265v1 [cs.LG])
    (2 min) We study the problem of few-shot graph classification across domains with nonequivalent feature spaces by introducing three new cross-domain benchmarks constructed from publicly available datasets. We also propose an attention-based graph encoder that uses three congruent views of graphs, one contextual and two topological views, to learn representations of task-specific information for fast adaptation, and task-agnostic information for knowledge transfer. We run exhaustive experiments to evaluate the performance of contrastive and meta-learning strategies. We show that when coupled with metric-based meta-learning frameworks, the proposed encoder achieves the best average meta-test classification accuracy across all benchmarks. The source code and data will be released here: https://github.com/kavehhassani/metagrl
    Exploiting Meta-Cognitive Features for a Machine-Learning-Based One-Shot Group-Decision Aggregation. (arXiv:2201.08247v1 [cs.LG])
    (2 min) The outcome of a collective decision-making process, such as crowdsourcing, often relies on the procedure through which the perspectives of its individual members are aggregated. Popular aggregation methods, such as the majority rule, often fail to produce the optimal result, especially in high-complexity tasks. Methods that rely on meta-cognitive information, such as confidence-based methods and the Surprisingly Popular Option, had shown an improvement in various tasks. However, there is still a significant number of cases with no optimal solution. Our aim is to exploit meta-cognitive information and to learn from it, for the purpose of enhancing the ability of the group to produce a correct answer. Specifically, we propose two different feature-representation approaches: (1) Response-Centered feature Representation (RCR), which focuses on the characteristics of the individual response instances, and (2) Answer-Centered feature Representation (ACR), which focuses on the characteristics of each of the potential answers. Using these two feature-representation approaches, we train Machine-Learning (ML) models, for the purpose of predicting the correctness of a response and of an answer. The trained models are used as the basis of an ML-based aggregation methodology that, contrary to other ML-based techniques, has the advantage of being a "one-shot" technique, independent from the crowd-specific composition and personal record, and adaptive to various types of situations. To evaluate our methodology, we collected 2490 responses for different tasks, which we used for feature engineering and for the training of ML models. We tested our feature-representation approaches through the performance of our proposed ML-based aggregation methods. The results show an increase of 20% to 35% in the success rate, compared to the use of standard rule-based aggregation methods.
    Selecting the suitable resampling strategy for imbalanced data classification regarding dataset properties. (arXiv:2201.07932v1 [cs.LG])
    (2 min) In many application domains such as medicine, information retrieval, cybersecurity, social media, etc., datasets used for inducing classification models often have an unequal distribution of the instances of each class. This situation, known as imbalanced data classification, causes low predictive performance for the minority class examples. Thus, the prediction model is unreliable although the overall model accuracy can be acceptable. Oversampling and undersampling techniques are well-known strategies to deal with this problem by balancing the number of examples of each class. However, their effectiveness depends on several factors mainly related to data intrinsic characteristics, such as imbalance ratio, dataset size and dimensionality, overlapping between classes or borderline examples. In this work, the impact of these factors is analyzed through a comprehensive comparative study involving 40 datasets from different application areas. The objective is to obtain models for automatic selection of the best resampling strategy for any dataset based on its characteristics. These models allow us to check several factors simultaneously considering a wide range of values since they are induced from very varied datasets that cover a broad spectrum of conditions. This differs from most studies that focus on the individual analysis of the characteristics or cover a small range of values. In addition, the study encompasses both basic and advanced resampling strategies that are evaluated by means of eight different performance metrics, including new measures specifically designed for imbalanced data classification. The general nature of the proposal allows the choice of the most appropriate method regardless of the domain, avoiding the search for special purpose techniques that could be valid for the target data.
    Informative Pseudo-Labeling for Graph Neural Networks with Few Labels. (arXiv:2201.07951v1 [cs.LG])
    (2 min) Graph Neural Networks (GNNs) have achieved state-of-the-art results for semi-supervised node classification on graphs. Nevertheless, the challenge of how to effectively learn GNNs with very few labels is still under-explored. As one of the prevalent semi-supervised methods, pseudo-labeling has been proposed to explicitly address the label scarcity problem. It aims to augment the training set with pseudo-labeled unlabeled nodes with high confidence so as to re-train a supervised model in a self-training cycle. However, the existing pseudo-labeling approaches often suffer from two major drawbacks. First, they tend to conservatively expand the label set by selecting only high-confidence unlabeled nodes without assessing their informativeness. Unfortunately, those high-confidence nodes often convey overlapping information with given labels, leading to minor improvements for model re-training. Second, these methods incorporate pseudo-labels to the same loss function with genuine labels, ignoring their distinct contributions to the classification task. In this paper, we propose a novel informative pseudo-labeling framework, called InfoGNN, to facilitate learning of GNNs with extremely few labels. Our key idea is to pseudo label the most informative nodes that can maximally represent the local neighborhoods via mutual information maximization. To mitigate the potential label noise and class-imbalance problem arising from pseudo labeling, we also carefully devise a generalized cross entropy loss with a class-balanced regularization to incorporate generated pseudo labels into model re-training. Extensive experiments on six real-world graph datasets demonstrate that our proposed approach significantly outperforms state-of-the-art baselines and strong self-supervised methods on graphs.
    NNP/MM: Fast molecular dynamics simulations with machine learning potentials and molecular mechanics. (arXiv:2201.08110v1 [q-bio.BM])
    (2 min) Parametric and non-parametric machine learning potentials have emerged recently as a way to improve the accuracy of bio-molecular simulations. Here, we present NNP/MM, an hybrid method integrating neural network potentials (NNPs) and molecular mechanics (MM). It allows to simulate a part of molecular system with NNP, while the rest is simulated with MM for efficiency. The method is currently available in ACEMD using OpenMM plugins to optimize the performance of NNPs. The achieved performance is slower but comparable to the state-of-the-art GPU-accelerated MM simulations. We validated NNP/MM by performing MD simulations of four protein-ligand complexes, where NNP is used for the intra-molecular interactions of a lignad and MM for the rest interactions. This shows that NNP can already replace MM for small molecules in protein-ligand simulations. The combined sampling of each complex is 1 microsecond, which are the longest simulations of NNP/MM ever reported. Finally, we have made the setup of the NNP/MM simulations simple and user-friendly.
    Network-based link prediction of scientific concepts -- a Science4Cast competition entry. (arXiv:2201.07978v1 [cs.SI])
    (2 min) We report on a model built to predict links in a complex network of scientific concepts, in the context of the Science4Cast 2021 competition. We show that the network heavily favours linking nodes of high degree, indicating that new scientific connections are primarily made between popular concepts, which constitutes the main feature of our model. Besides this notion of popularity, we use a measure of similarity between nodes quantified by a normalized count of their common neighbours to improve the model. Finally, we show that the model can be further improved by considering a time-weighted adjacency matrix with both older and newer links having higher impact in the predictions, representing rooted concepts and state of the art research, respectively.
    Low-Pass Filtering SGD for Recovering Flat Optima in the Deep Learning Optimization Landscape. (arXiv:2201.08025v1 [cs.LG])
    (2 min) In this paper, we study the sharpness of a deep learning (DL) loss landscape around local minima in order to reveal systematic mechanisms underlying the generalization abilities of DL models. Our analysis is performed across varying network and optimizer hyper-parameters, and involves a rich family of different sharpness measures. We compare these measures and show that the low-pass filter-based measure exhibits the highest correlation with the generalization abilities of DL models, has high robustness to both data and label noise, and furthermore can track the double descent behavior for neural networks. We next derive the optimization algorithm, relying on the low-pass filter (LPF), that actively searches the flat regions in the DL optimization landscape using SGD-like procedure. The update of the proposed algorithm, that we call LPF-SGD, is determined by the gradient of the convolution of the filter kernel with the loss function and can be efficiently computed using MC sampling. We empirically show that our algorithm achieves superior generalization performance compared to the common DL training strategies. On the theoretical front, we prove that LPF-SGD converges to a better optimal point with smaller generalization error than SGD.
    Corrigendum and addendum to: How Populist are Parties? Measuring Degrees of Populism in Party Manifestos Using Supervised Machine Learning. (arXiv:2201.07972v1 [physics.soc-ph])
    (2 min) This paper is a corrigendum and addendum to the previously published article: 'How Populist are Parties? Measuring Degrees of Populism in Party Manifestos Using Supervised Machine Learning' (Political Analysis, 1-17. doi:10.1017/pan.2021.29). These corrigendum and addendum were prepared to correct errors in data labelling and show some extra insights not included in the previously published paper. Here, we report these corrections and point to some additional conclusions by focusing on the effects of the label reshuffling per parties and years and presenting new figures wherever appropriate. We show that although the simplified labelling method proposed in the previously-published article can induce biases in the correlations with expert scores, random labelling reduces correlations significantly. We show that this is also true for correlations based on a manually-coded data set. These modifications are based on other evidence and results reported in detail in a future publication.
    On Aliased Resizing and Surprising Subtleties in GAN Evaluation. (arXiv:2104.11222v2 [cs.CV] UPDATED)
    (2 min) Metrics for evaluating generative models aim to measure the discrepancy between real and generated images. The often-used Frechet Inception Distance (FID) metric, for example, extracts "high-level" features using a deep network from the two sets. However, we find that the differences in "low-level" preprocessing, specifically image resizing and compression, can induce large variations and have unforeseen consequences. For instance, when resizing an image, e.g., with a bilinear or bicubic kernel, signal processing principles mandate adjusting prefilter width depending on the downsampling factor, to antialias to the appropriate bandwidth. However, commonly-used implementations use a fixed-width prefilter, resulting in aliasing artifacts. Such aliasing leads to corruptions in the feature extraction downstream. Next, lossy compression, such as JPEG, is commonly used to reduce the file size of an image. Although designed to minimally degrade the perceptual quality of an image, the operation also produces variations downstream. Furthermore, we show that if compression is used on real training images, FID can actually improve if the generated images are also subsequently compressed. This paper shows that choices in low-level image processing have been an underappreciated aspect of generative modeling. We identify and characterize variations in generative modeling development pipelines, provide recommendations based on signal processing principles, and release a reference implementation to facilitate future comparisons.
    Long Short-Term Memory Neural Network for Financial Time Series. (arXiv:2201.08218v1 [q-fin.ST])
    (2 min) Performance forecasting is an age-old problem in economics and finance. Recently, developments in machine learning and neural networks have given rise to non-linear time series models that provide modern and promising alternatives to traditional methods of analysis. In this paper, we present an ensemble of independent and parallel long short-term memory (LSTM) neural networks for the prediction of stock price movement. LSTMs have been shown to be especially suited for time series data due to their ability to incorporate past information, while neural network ensembles have been found to reduce variability in results and improve generalization. A binary classification problem based on the median of returns is used, and the ensemble's forecast depends on a threshold value, which is the minimum number of LSTMs required to agree upon the result. The model is applied to the constituents of the smaller, less efficient Stockholm OMX30 instead of other major market indices such as the DJIA and S&P500 commonly found in literature. With a straightforward trading strategy, comparisons with a randomly chosen portfolio and a portfolio containing all the stocks in the index show that the portfolio resulting from the LSTM ensemble provides better average daily returns and higher cumulative returns over time. Moreover, the LSTM portfolio also exhibits less volatility, leading to higher risk-return ratios.
    Homogenization of Existing Inertial-Based Datasets to Support Human Activity Recognition. (arXiv:2201.07891v1 [eess.SP])
    (2 min) Several techniques have been proposed to address the problem of recognizing activities of daily living from signals. Deep learning techniques applied to inertial signals have proven to be effective, achieving significant classification accuracy. Recently, research in human activity recognition (HAR) models has been almost totally model-centric. It has been proven that the number of training samples and their quality are critical for obtaining deep learning models that both perform well independently of their architecture, and that are more robust to intraclass variability and interclass similarity. Unfortunately, publicly available datasets do not always contain hight quality data and a sufficiently large and diverse number of samples (e.g., number of subjects, type of activity performed, and duration of trials). Furthermore, datasets are heterogeneous among them and therefore cannot be trivially combined to obtain a larger set. The final aim of our work is the definition and implementation of a platform that integrates datasets of inertial signals in order to make available to the scientific community large datasets of homogeneous signals, enriched, when possible, with context information (e.g., characteristics of the subjects and device position). The main focus of our platform is to emphasise data quality, which is essential for training efficient models.
    Learning-From-Disagreement: A Model Comparison and Visual Analytics Framework. (arXiv:2201.07849v1 [cs.LG])
    (2 min) With the fast-growing number of classification models being produced every day, numerous model interpretation and comparison solutions have also been introduced. For example, LIME and SHAP can interpret what input features contribute more to a classifier's output predictions. Different numerical metrics (e.g., accuracy) can be used to easily compare two classifiers. However, few works can interpret the contribution of a data feature to a classifier in comparison with its contribution to another classifier. This comparative interpretation can help to disclose the fundamental difference between two classifiers, select classifiers in different feature conditions, and better ensemble two classifiers. To accomplish it, we propose a learning-from-disagreement (LFD) framework to visually compare two classification models. Specifically, LFD identifies data instances with disagreed predictions from two compared classifiers and trains a discriminator to learn from the disagreed instances. As the two classifiers' training features may not be available, we train the discriminator through a set of meta-features proposed based on certain hypotheses of the classifiers to probe their behaviors. Interpreting the trained discriminator with the SHAP values of different meta-features, we provide actionable insights into the compared classifiers. Also, we introduce multiple metrics to profile the importance of meta-features from different perspectives. With these metrics, one can easily identify meta-features with the most complementary behaviors in two classifiers, and use them to better ensemble the classifiers. We focus on binary classification models in the financial services and advertising industry to demonstrate the efficacy of our proposed framework and visualizations.
    Demand-Driven Asset Reutilization Analytics. (arXiv:2201.07921v1 [cs.LG])
    (2 min) Manufacturers have long benefited from reusing returned products and parts. This benevolent approach can minimize cost and help the manufacturer to play a role in sustaining the environment, something which is of utmost importance these days because of growing environment concerns. Reuse of returned parts and products aids environment sustainability because doing so helps reduce the use of raw materials, eliminate energy use to produce new parts, and minimize waste materials. However, handling returns effectively and efficiently can be difficult if the processes do not provide the visibility that is necessary to track, manage, and re-use the returns. This paper applies advanced analytics on procurement data to increase reutilization in new build by optimizing Equal-to-New (ETN) parts return. This will reduce 'the spend' on new buy parts for building new product units. The process involves forecasting and matching returns supply to demand for new build. Complexity in the process is the forecasting and matching while making sure a reutilization engineering process is available. Also, this will identify high demand/value/yield parts for development engineering to focus. Analytics has been applied on different levels to enhance the optimization process including forecast of upgraded parts. Machine Learning algorithms are used to build an automated infrastructure that can support the transformation of ETN parts utilization in the procurement parts planning process. This system incorporate returns forecast in the planning cycle to reduce suppliers liability from 9 weeks to 12 months planning cycle, e.g., reduce 5% of 10 million US dollars liability.
    PROMPT: Learning Dynamic Resource Allocation Policies for Edge-Network Applications. (arXiv:2201.07916v1 [cs.LG])
    (2 min) A growing number of service providers are exploring methods to improve server utilization, reduce power consumption, and reduce total cost of ownership by co-scheduling high-priority latency-critical workloads with best-effort workloads. This practice requires strict resource allocation between workloads to reduce resource contention and maintain Quality of Service (QoS) guarantees. Prior resource allocation works have been shown to improve server utilization under ideal circumstances, yet often compromise QoS guarantees or fail to find valid resource allocations in more dynamic operating environments. Further, prior works are fundamentally reliant upon QoS measurements that can, in practice, exhibit significant transient fluctuations, thus stable control behavior cannot be reliably achieved. In this paper, we propose a novel framework for dynamic resource allocation based on proactive QoS prediction. These predictions help guide a reinforcement-learning-based resource controller towards optimal resource allocations while avoiding transient QoS violations due to fluctuating workload demands. Evaluation shows that the proposed method incurs 4.3x fewer QoS violations, reduces severity of QoS violations by 3.7x, improves best-effort workload performance, and improves overall power efficiency compared with prior work.
    PDE-Based Optimal Strategy for Unconstrained Online Learning. (arXiv:2201.07877v1 [cs.LG])
    (2 min) Unconstrained Online Linear Optimization (OLO) is a practical problem setting to study the training of machine learning models. Existing works proposed a number of potential-based algorithms, but in general the design of such potential functions is ad hoc and heavily relies on guessing. In this paper, we present a framework that generates time-varying potential functions by solving a Partial Differential Equation (PDE). Our framework recovers some classical potentials, and more importantly provides a systematic approach to design new ones. The power of our framework is demonstrated through a concrete example. When losses are 1-Lipschitz, we design a novel OLO algorithm with anytime regret upper bound $C\sqrt{T}+||u||\sqrt{2T}[\sqrt{\log(1+||u||/C)}+2]$, where $C$ is a user-specified constant and $u$ is any comparator whose norm is unknown and unbounded a priori. By constructing a matching lower bound, we further show that the leading order term, including the constant multiplier $\sqrt{2}$, is tight. To our knowledge, this is the first parameter-free algorithm with optimal leading constant. The obtained theoretical benefits are validated by experiments.
    Convolutional Neural Networks for Spherical Signal Processing via Spherical Haar Tight Framelets. (arXiv:2201.07890v1 [eess.SP])
    (2 min) In this paper, we develop a general theoretical framework for constructing Haar-type tight framelets on any compact set with a hierarchical partition. In particular, we construct a novel area-regular hierarchical partition on the 2-sphere and establish its corresponding spherical Haar tight framelets with directionality. We conclude by evaluating and illustrating the effectiveness of our area-regular spherical Haar tight framelets in several denoising experiments. Furthermore, we propose a convolutional neural network (CNN) model for spherical signal denoising which employs the fast framelet decomposition and reconstruction algorithms. Experiment results show that our proposed CNN model outperforms threshold methods, and processes strong generalization and robustness properties.
    Machine Learning Enhances Algorithms for Quantifying Non-Equilibrium Dynamics in Correlation Spectroscopy Experiments to Reach Frame-Rate-Limited Time Resolution. (arXiv:2201.07889v1 [eess.SP])
    (2 min) Analysis of X-ray Photon Correlation Spectroscopy (XPCS) data for non-equilibrium dynamics often requires manual binning of age regions of an intensity-intensity correlation function. This leads to a loss of temporal resolution and accumulation of systematic error for the parameters quantifying the dynamics, especially in cases with considerable noise. Moreover, the experiments with high data collection rates create the need for automated online analysis, where manual binning is not possible. Here, we integrate a denoising autoencoder model into algorithms for analysis of non-equilibrium two-time intensity-intensity correlation functions. The model can be applied to an input of an arbitrary size. Noise reduction allows to extract the parameters that characterize the sample dynamics with temporal resolution limited only by frame rates. Not only does it improve the quantitative usage of the data, but it also creates the potential for automating the analytical workflow. Various approaches for uncertainty quantification and extension of the model for anomalies detection are discussed.
    GAN-based Matrix Factorization for Recommender Systems. (arXiv:2201.08042v1 [cs.IR])
    (2 min) Proposed in 2014, Generative Adversarial Networks (GAN) initiated a fresh interest in generative modelling. They immediately achieved state-of-the-art in image synthesis, image-to-image translation, text-to-image generation, image inpainting and have been used in sciences ranging from medicine to high-energy particle physics. Despite their popularity and ability to learn arbitrary distributions, GAN have not been widely applied in recommender systems (RS). Moreover, only few of the techniques that have introduced GAN in RS have employed them directly as a collaborative filtering (CF) model. In this work we propose a new GAN-based approach that learns user and item latent factors in a matrix factorization setting for the generic top-N recommendation problem. Following the vector-wise GAN training approach for RS introduced by CFGAN, we identify 2 unique issues when utilizing GAN for CF. We propose solutions for both of them by using an autoencoder as discriminator and incorporating an additional loss function for the generator. We evaluate our model, GANMF, through well-known datasets in the RS community and show improvements over traditional CF approaches and GAN-based models. Through an ablation study on the components of GANMF we aim to understand the effects of our architectural choices. Finally, we provide a qualitative evaluation of the matrix factorization performance of GANMF.
    Two-Sample Testing in Reinforcement Learning. (arXiv:2201.08078v1 [cs.LG])
    (2 min) Value-based reinforcement-learning algorithms have shown strong performances in games, robotics, and other real-world applications. The most popular sample-based method is $Q$-Learning. A $Q$-value is the expected return for a state-action pair when following a particular policy, and the algorithm subsequently performs updates by adjusting the current $Q$-value towards the observed reward and the maximum of the $Q$-values of the next state. The procedure introduces maximization bias, and solutions like Double $Q$-Learning have been considered. We frame the bias problem statistically and consider it an instance of estimating the maximum expected value (MEV) of a set of random variables. We propose the $T$-Estimator (TE) based on two-sample testing for the mean. The TE flexibly interpolates between over- and underestimation by adjusting the level of significance of the underlying hypothesis tests. A generalization termed $K$-Estimator (KE) obeys the same bias and variance bounds as the TE while relying on a nearly arbitrary kernel function. Using the TE and the KE, we introduce modifications of $Q$-Learning and its neural network analog, the Deep $Q$-Network. The proposed estimators and algorithms are thoroughly tested and validated on a diverse set of tasks and environments, illustrating the performance potential of the TE and KE.
    Correlated-informed neural networks: a new machine learning framework to predict pressure drop in micro-channels. (arXiv:2201.07835v1 [cs.LG])
    (2 min) Accurate pressure drop estimation in forced boiling phenomena is important during the thermal analysis and the geometric design of cryogenic heat exchangers. However, current methods to predict the pressure drop have one of two problems: lack of accuracy or generalization to different situations. In this work, we present the correlated-informed neural networks (CoINN), a new paradigm in applying the artificial neural network (ANN) technique combined with a successful pressure drop correlation as a mapping tool to predict the pressure drop of zeotropic mixtures in micro-channels. The proposed approach is inspired by Transfer Learning, highly used in deep learning problems with reduced datasets. Our method improves the ANN performance by transferring the knowledge of the Sun & Mishima correlation for the pressure drop to the ANN. The correlation having physical and phenomenological implications for the pressure drop in micro-channels considerably improves the performance and generalization capabilities of the ANN. The final architecture consists of three inputs: the mixture vapor quality, the micro-channel inner diameter, and the available pressure drop correlation. The results show the benefits gained using the correlated-informed approach predicting experimental data used for training and a posterior test with a mean relative error (mre) of 6%, lower than the Sun & Mishima correlation of 13%. Additionally, this approach can be extended to other mixtures and experimental settings, a missing feature in other approaches for mapping correlations using ANNs for heat transfer applications.
    Before and After: Machine learning for perioperative patient care. (arXiv:2201.08095v1 [cs.CY])
    (2 min) For centuries nursing has been known as a job that requires complex manual operations, that cannot be automated or replaced by any machinery. All the devices and techniques have been invented only to support, but never fully replace, a person with knowledge and expert intuition. With the rise of Artificial Intelligence and continuously increasing digital data flow in healthcare, new tools have arrived to improve patient care and reduce the labour-intensive work conditions of a nurse. This cross-disciplinary review aims to build a bridge over the gap between computer science and nursing. It outlines and classifies the methods for machine learning and data processing in patient care before and after the operation. It comprises of Process-, Patient-, Operator-, Feedback-, and Technology-centric classifications. The presented classifications are based on the technical aspects of patient case.
    TerViT: An Efficient Ternary Vision Transformer. (arXiv:2201.08050v1 [cs.CV])
    (2 min) Vision transformers (ViTs) have demonstrated great potential in various visual tasks, but suffer from expensive computational and memory cost problems when deployed on resource-constrained devices. In this paper, we introduce a ternary vision transformer (TerViT) to ternarize the weights in ViTs, which are challenged by the large loss surface gap between real-valued and ternary parameters. To address the issue, we introduce a progressive training scheme by first training 8-bit transformers and then TerViT, and achieve a better optimization than conventional methods. Furthermore, we introduce channel-wise ternarization, by partitioning each matrix to different channels, each of which is with an unique distribution and ternarization interval. We apply our methods to popular DeiT and Swin backbones, and extensive results show that we can achieve competitive performance. For example, TerViT can quantize Swin-S to 13.1MB model size while achieving above 79% Top-1 accuracy on ImageNet dataset.
    Caring Without Sharing: A Federated Learning Crowdsensing Framework for Diversifying Representation of Cities. (arXiv:2201.07980v1 [cs.CY])
    (2 min) Mobile Crowdsensing has become main stream paradigm for researchers to collect behavioral data from citizens in large scales. This valuable data can be leveraged to create centralized repositories that can be used to train advanced Artificial Intelligent (AI) models for various services that benefit society in all aspects. Although decades of research has explored the viability of Mobile Crowdsensing in terms of incentives and many attempts have been made to reduce the participation barriers, the overshadowing privacy concerns regarding sharing personal data still remain. Recently a new pathway has emerged to enable to shift MCS paradigm towards a more privacy-preserving collaborative learning, namely Federated Learning. In this paper, we posit a first of its kind framework for this emerging paradigm. We demonstrate the functionalities of our framework through a case study of diversifying two vision algorithms through to learn the representation of ordinary sidewalk obstacles as part of enhancing visually impaired navigation.
    Hybrid Reinforcement Learning-Based Eco-Driving Strategy for Connected and Automated Vehicles at Signalized Intersections. (arXiv:2201.07833v1 [eess.SY])
    (2 min) Taking advantage of both vehicle-to-everything (V2X) communication and automated driving technology, connected and automated vehicles are quickly becoming one of the transformative solutions to many transportation problems. However, in a mixed traffic environment at signalized intersections, it is still a challenging task to improve overall throughput and energy efficiency considering the complexity and uncertainty in the traffic system. In this study, we proposed a hybrid reinforcement learning (HRL) framework which combines the rule-based strategy and the deep reinforcement learning (deep RL) to support connected eco-driving at signalized intersections in mixed traffic. Vision-perceptive methods are integrated with vehicle-to-infrastructure (V2I) communications to achieve higher mobility and energy efficiency in mixed connected traffic. The HRL framework has three components: a rule-based driving manager that operates the collaboration between the rule-based policies and the RL policy; a multi-stream neural network that extracts the hidden features of vision and V2I information; and a deep RL-based policy network that generate both longitudinal and lateral eco-driving actions. In order to evaluate our approach, we developed a Unity-based simulator and designed a mixed-traffic intersection scenario. Moreover, several baselines were implemented to compare with our new design, and numerical experiments were conducted to test the performance of the HRL model. The experiments show that our HRL method can reduce energy consumption by 12.70% and save 11.75% travel time when compared with a state-of-the-art model-based Eco-Driving approach.
    Towards deep observation: A systematic survey on artificial intelligence techniques to monitor fetus via Ultrasound Images. (arXiv:2201.07935v1 [cs.LG])
    (2 min) Developing innovative informatics approaches aimed to enhance fetal monitoring is a burgeoning field of study in reproductive medicine. Several reviews have been conducted regarding Artificial intelligence (AI) techniques to improve pregnancy outcomes. They are limited by focusing on specific data such as mother's care during pregnancy. This systematic survey aims to explore how artificial intelligence (AI) can assist with fetal growth monitoring via Ultrasound (US) image. We used eight medical and computer science bibliographic databases, including PubMed, Embase, PsycINFO, ScienceDirect, IEEE explore, ACM Library, Google Scholar, and the Web of Science. We retrieved studies published between 2010 to 2021. Data extracted from studies were synthesized using a narrative approach. Out of 1269 retrieved studies, we included 107 distinct studies from queries that were relevant to the topic in the survey. We found that 2D ultrasound images were more popular (n=88) than 3D and 4D ultrasound images (n=19). Classification is the most used method (n=42), followed by segmentation (n=31), classification integrated with segmentation (n=16) and other miscellaneous such as object-detection, regression and reinforcement learning (n=18). The most common areas within the pregnancy domain were the fetus head (n=43), then fetus body (n=31), fetus heart (n=13), fetus abdomen (n=10), and lastly the fetus face (n=10). In the most recent studies, deep learning techniques were primarily used (n=81), followed by machine learning (n=16), artificial neural network (n=7), and reinforcement learning (n=2). AI techniques played a crucial role in predicting fetal diseases and identifying fetus anatomy structures during pregnancy. More research is required to validate this technology from a physician's perspective, such as pilot studies and randomized controlled trials on AI and its applications in a hospital setting.
    Cognitive Explainers of Graph Neural Networks Based on Medical Concepts. (arXiv:2201.07798v1 [cs.LG])
    (2 min) Although deep neural networks (DNN) have achieved state-of-the-art performance in various fields, some unexpected errors are often found in the neural network, which is very dangerous for some tasks requiring high reliability and high security.The non-transparency and unexplainably of CNN still limit its application in many fields, such as medical care and finance. Despite current studies that have been committed to visualizing the decision process of DNN, most of these methods focus on the low level and do not take into account the prior knowledge of medicine.In this work, we propose an interpretable framework based on key medical concepts, enabling CNN to explain from the perspective of doctors' cognition.We propose an interpretable automatic recognition framework for the ultrasonic standard plane, which uses a concept-based graph convolutional neural network to construct the relationships between key medical concepts, to obtain an interpretation consistent with a doctor's cognition.
    DRTCI: Learning Disentangled Representations for Temporal Causal Inference. (arXiv:2201.08137v1 [cs.LG])
    (2 min) Medical professionals evaluating alternative treatment plans for a patient often encounter time varying confounders, or covariates that affect both the future treatment assignment and the patient outcome. The recently proposed Counterfactual Recurrent Network (CRN) accounts for time varying confounders by using adversarial training to balance recurrent historical representations of patient data. However, this work assumes that all time varying covariates are confounding and thus attempts to balance the full state representation. Given that the actual subset of covariates that may in fact be confounding is in general unknown, recent work on counterfactual evaluation in the static, non-temporal setting has suggested that disentangling the covariate representation into separate factors, where each either influence treatment selection, patient outcome or both can help isolate selection bias and restrict balancing efforts to factors that influence outcome, allowing the remaining factors which predict treatment without needlessly being balanced.
    One-Step Abductive Multi-Target Learning with Diverse Noisy Label Samples. (arXiv:2201.07933v1 [cs.LG])
    (2 min) One-step abductive multi-target learning (OSAMTL) was proposed to handle complex noisy labels. In this paper, giving definition of diverse noisy label samples (DNLS), we propose one-step abductive multi-target learning with DNLS (OSAMTL-DNLS) to expand the methodology of original OSAMTL to better handle complex noisy labels.
    Investigating underdiagnosis of AI algorithms in the presence of multiple sources of dataset bias. (arXiv:2201.07856v1 [cs.AI])
    (2 min) Deep learning models have shown great potential for image-based diagnosis assisting clinical decision making. At the same time, an increasing number of reports raise concerns about the potential risk that machine learning could amplify existing health disparities due to human biases that are embedded in the training data. It is of great importance to carefully investigate the extent to which biases may be reproduced or even amplified if we wish to build fair artificial intelligence systems. Seyyed-Kalantari et al. advance this conversation by analysing the performance of a disease classifier across population subgroups. They raise performance disparities related to underdiagnosis as a point of concern; we identify areas from this analysis which we believe deserve additional attention. Specifically, we wish to highlight some theoretical and practical difficulties associated with assessing model fairness through testing on data drawn from the same biased distribution as the training data, especially when the sources and amount of biases are unknown.
    Complexity from Adaptive-Symmetries Breaking: Global Minima in the Statistical Mechanics of Deep Neural Networks. (arXiv:2201.07934v1 [cs.LG])
    (2 min) An antithetical concept, adaptive symmetry, to conservative symmetry in physics is proposed to understand the deep neural networks (DNNs). It characterizes the invariance of variance, where a biotic system explores different pathways of evolution with equal probability in absence of feedback signals, and complex functional structure emerges from quantitative accumulation of adaptive-symmetries breaking in response to feedback signals. Theoretically and experimentally, we characterize the optimization process of a DNN system as an extended adaptive-symmetry-breaking process. One particular finding is that a hierarchically large DNN would have a large reservoir of adaptive symmetries, and when the information capacity of the reservoir exceeds the complexity of the dataset, the system could absorb all perturbations of the examples and self-organize into a functional structure of zero training errors measured by a certain surrogate risk. More specifically, this process is characterized by a statistical-mechanical model that could be appreciated as a generalization of statistics physics to the DNN organized complex system, and characterizes regularities in higher dimensionality. The model consists of three constitutes that could be appreciated as the counterparts of Boltzmann distribution, Ising model, and conservative symmetry, respectively: (1) a stochastic definition/interpretation of DNNs that is a multilayer probabilistic graphical model, (2) a formalism of circuits that perform biological computation, (3) a circuit symmetry from which self-similarity between the microscopic and the macroscopic adaptability manifests. The model is analyzed with a method referred as the statistical assembly method that analyzes the coarse-grained behaviors (over a symmetry group) of the heterogeneous hierarchical many-body interaction in DNNs.
    Communication-Efficient Device Scheduling for Federated Learning Using Stochastic Optimization. (arXiv:2201.07912v1 [cs.LG])
    (2 min) Federated learning (FL) is a useful tool in distributed machine learning that utilizes users' local datasets in a privacy-preserving manner. When deploying FL in a constrained wireless environment; however, training models in a time-efficient manner can be a challenging task due to intermittent connectivity of devices, heterogeneous connection quality, and non-i.i.d. data. In this paper, we provide a novel convergence analysis of non-convex loss functions using FL on both i.i.d. and non-i.i.d. datasets with arbitrary device selection probabilities for each round. Then, using the derived convergence bound, we use stochastic optimization to develop a new client selection and power allocation algorithm that minimizes a function of the convergence bound and the average communication time under a transmit power constraint. We find an analytical solution to the minimization problem. One key feature of the algorithm is that knowledge of the channel statistics is not required and only the instantaneous channel state information needs to be known. Using the FEMNIST and CIFAR-10 datasets, we show through simulations that the communication time can be significantly decreased using our algorithm, compared to uniformly random participation.
    Building a Performance Model for Deep Learning Recommendation Model Training on GPUs. (arXiv:2201.07821v1 [cs.LG])
    (2 min) We devise a performance model for GPU training of Deep Learning Recommendation Models (DLRM), whose GPU utilization is low compared to other well-optimized CV and NLP models. We show that both the device active time (the sum of kernel runtimes) and the device idle time are important components of the overall device time. We therefore tackle them separately by (1) flexibly adopting heuristic-based and ML-based kernel performance models for operators that dominate the device active time, and (2) categorizing operator overheads into five types to determine quantitatively their contribution to the device active time. Combining these two parts, we propose a critical-path-based algorithm to predict the per-batch training time of DLRM by traversing its execution graph. We achieve less than 10% geometric mean average error (GMAE) in all kernel performance modeling, and 5.23% and 7.96% geomean errors for GPU active time and overall end-to-end per-batch training time prediction, respectively. We show that our general performance model not only achieves low prediction error on DLRM, which has highly customized configurations and is dominated by multiple factors, but also yields comparable accuracy on other compute-bound ML models targeted by most previous methods. Using this performance model and graph-level data and task dependency analyses, we show our system can provide more general model-system co-design than previous methods.
    Adaptive Energy Management for Self-Sustainable Wearables in Mobile Health. (arXiv:2201.07888v1 [eess.SP])
    (2 min) Wearable devices that integrate multiple sensors, processors, and communication technologies have the potential to transform mobile health for remote monitoring of health parameters. However, the small form factor of the wearable devices limits the battery size and operating lifetime. As a result, the devices require frequent recharging, which has limited their widespread adoption. Energy harvesting has emerged as an effective method towards sustainable operation of wearable devices. Unfortunately, energy harvesting alone is not sufficient to fulfill the energy requirements of wearable devices. This paper studies the novel problem of adaptive energy management towards the goal of self-sustainable wearables by using harvested energy to supplement the battery energy and to reduce manual recharging by users. To solve this problem, we propose a principled algorithm referred as AdaEM. There are two key ideas behind AdaEM. First, it uses machine learning (ML) methods to learn predictive models of user activity and energy usage patterns. These models allow us to estimate the potential of energy harvesting in a day as a function of the user activities. Second, it reasons about the uncertainty in predictions and estimations from the ML models to optimize the energy management decisions using a dynamic robust optimization (DyRO) formulation. We propose a light-weight solution for DyRO to meet the practical needs of deployment. We validate the AdaEM approach on a wearable device prototype consisting of solar and motion energy harvesting using real-world data of user activities. Experiments show that AdaEM achieves solutions that are within 5% of the optimal with less than 0.005% execution time and energy overhead.
    AstBERT: Enabling Language Model for Code Understanding with Abstract Syntax Tree. (arXiv:2201.07984v1 [cs.AI])
    (2 min) Using a pre-trained language model (i.e. BERT) to apprehend source codes has attracted increasing attention in the natural language processing community. However, there are several challenges when it comes to applying these language models to solve programming language (PL) related problems directly, the significant one of which is the lack of domain knowledge issue that substantially deteriorates the model's performance. To this end, we propose the AstBERT model, a pre-trained language model aiming to better understand the PL using the abstract syntax tree (AST). Specifically, we collect a colossal amount of source codes (both java and python) from GitHub and incorporate the contextual code knowledge into our model through the help of code parsers, in which AST information of the source codes can be interpreted and integrated. We verify the performance of the proposed model on code information extraction and code search tasks, respectively. Experiment results show that our AstBERT model achieves state-of-the-art performance on both downstream tasks (with 96.4% for code information extraction task, and 57.12% for code search task).
    Safe Deep RL in 3D Environments using Human Feedback. (arXiv:2201.08102v1 [cs.LG])
    (2 min) Agents should avoid unsafe behaviour during both training and deployment. This typically requires a simulator and a procedural specification of unsafe behaviour. Unfortunately, a simulator is not always available, and procedurally specifying constraints can be difficult or impossible for many real-world tasks. A recently introduced technique, ReQueST, aims to solve this problem by learning a neural simulator of the environment from safe human trajectories, then using the learned simulator to efficiently learn a reward model from human feedback. However, it is yet unknown whether this approach is feasible in complex 3D environments with feedback obtained from real humans - whether sufficient pixel-based neural simulator quality can be achieved, and whether the human data requirements are viable in terms of both quantity and quality. In this paper we answer this question in the affirmative, using ReQueST to train an agent to perform a 3D first-person object collection task using data entirely from human contractors. We show that the resulting agent exhibits an order of magnitude reduction in unsafe behaviour compared to standard reinforcement learning.
    Recursive Constraints to Prevent Instability in Constrained Reinforcement Learning. (arXiv:2201.07958v1 [cs.LG])
    (2 min) We consider the challenge of finding a deterministic policy for a Markov decision process that uniformly (in all states) maximizes one reward subject to a probabilistic constraint over a different reward. Existing solutions do not fully address our precise problem definition, which nevertheless arises naturally in the context of safety-critical robotic systems. This class of problem is known to be hard, but the combined requirements of determinism and uniform optimality can create learning instability. In this work, after describing and motivating our problem with a simple example, we present a suitable constrained reinforcement learning algorithm that prevents learning instability, using recursive constraints. Our proposed approach admits an approximative form that improves efficiency and is conservative w.r.t. the constraint.
    Unsupervised Graph Poisoning Attack via Contrastive Loss Back-propagation. (arXiv:2201.07986v1 [cs.LG])
    (2 min) Graph contrastive learning is the state-of-the-art unsupervised graph representation learning framework and has shown comparable performance with supervised approaches. However, evaluating whether the graph contrastive learning is robust to adversarial attacks is still an open problem because most existing graph adversarial attacks are supervised models, which means they heavily rely on labels and can only be used to evaluate the graph contrastive learning in a specific scenario. For unsupervised graph representation methods such as graph contrastive learning, it is difficult to acquire labels in real-world scenarios, making traditional supervised graph attack methods difficult to be applied to test their robustness. In this paper, we propose a novel unsupervised gradient-based adversarial attack that does not rely on labels for graph contrastive learning. We compute the gradients of the adjacency matrices of the two views and flip the edges with gradient ascent to maximize the contrastive loss. In this way, we can fully use multiple views generated by the graph contrastive learning models and pick the most informative edges without knowing their labels, and therefore can promisingly support our model adapted to more kinds of downstream tasks. Extensive experiments show that our attack outperforms unsupervised baseline attacks and has comparable performance with supervised attacks in multiple downstream tasks including node classification and link prediction. We further show that our attack can be transferred to other graph representation models as well.
  • stat.ML updates on arXiv.org

    Technologies for Trustworthy Machine Learning: A Survey in a Socio-Technical Context. (arXiv:2007.08911v3 [cs.LG] UPDATED)
    (2 min) Concerns about the societal impact of AI-based services and systems has encouraged governments and other organisations around the world to propose AI policy frameworks to address fairness, accountability, transparency and related topics. To achieve the objectives of these frameworks, the data and software engineers who build machine-learning systems require knowledge about a variety of relevant supporting tools and techniques. In this paper we provide an overview of technologies that support building trustworthy machine learning systems, i.e., systems whose properties justify that people place trust in them. We argue that four categories of system properties are instrumental in achieving the policy objectives, namely fairness, explainability, auditability and safety & security (FEAS). We discuss how these properties need to be considered across all stages of the machine learning life cycle, from data collection through run-time model inference. As a consequence, we survey in this paper the main technologies with respect to all four of the FEAS properties, for data-centric as well as model-centric stages of the machine learning system life cycle. We conclude with an identification of open research problems, with a particular focus on the connection between trustworthy machine learning technologies and their implications for individuals and society.
    Node Feature Kernels Increase Graph Convolutional Network Robustness. (arXiv:2109.01785v2 [cs.LG] UPDATED)
    (2 min) The robustness of the much-used Graph Convolutional Networks (GCNs) to perturbations of their input is becoming a topic of increasing importance. In this paper, the random GCN is introduced for which a random matrix theory analysis is possible. This analysis suggests that if the graph is sufficiently perturbed, or in the extreme case random, then the GCN fails to benefit from the node features. It is furthermore observed that enhancing the message passing step in GCNs by adding the node feature kernel to the adjacency matrix of the graph structure solves this problem. An empirical study of a GCN utilised for node classification on six real datasets further confirms the theoretical findings and demonstrates that perturbations of the graph structure can result in GCNs performing significantly worse than Multi-Layer Perceptrons run on the node features alone. In practice, adding a node feature kernel to the message passing of perturbed graphs results in a significant improvement of the GCN's performance, thereby rendering it more robust to graph perturbations. Our code is publicly available at:https://github.com/ChangminWu/RobustGCN.
    A Framework for the Meta-Analysis of Randomized Experiments with Applications to Heavy-Tailed Response Data. (arXiv:2112.07602v3 [stat.ME] UPDATED)
    (2 min) A central obstacle in the objective assessment of treatment effect (TE) estimators in randomized control trials (RCTs) is the lack of ground truth (or validation set) to test their performance. In this paper, we provide a novel cross-validation-like methodology to address this challenge. The key insight of our procedure is that the noisy (but unbiased) difference-of-means estimate can be used as a ground truth "label" on a portion of the RCT, to test the performance of an estimator trained on the other portion. We combine this insight with an aggregation scheme, which borrows statistical strength across a large collection of RCTs, to present an end-to-end methodology for judging an estimator's ability to recover the underlying treatment effect. We evaluate our methodology across 709 RCTs implemented in the Amazon supply chain. In the corpus of AB tests at Amazon, we highlight the unique difficulties associated with recovering the treatment effect due to the heavy-tailed nature of the response variables. In this heavy-tailed setting, our methodology suggests that procedures that aggressively downweight or truncate large values, while introducing bias, lower the variance enough to ensure that the treatment effect is more accurately estimated.
    Quantum Discriminator for Binary Classification. (arXiv:2009.01235v2 [quant-ph] UPDATED)
    (2 min) Quantum computers have the unique ability to operate relatively quickly in high-dimensional spaces -- this is sought to give them a competitive advantage over classical computers. In this work, we propose a novel quantum machine learning model called the Quantum Discriminator, which leverages the ability of quantum computers to operate in the high-dimensional spaces. The quantum discriminator is trained using a quantum-classical hybrid algorithm in O(N logN) time, and inferencing is performed on a universal quantum computer in linear time. The quantum discriminator takes as input the binary features extracted from a given datum along with a prediction qubit initialized to the zero state and outputs the predicted label. We analyze its performance on the Iris data set and show that the quantum discriminator can attain 99% accuracy in simulation.
    Kernel Methods and Multi-layer Perceptrons Learn Linear Models in High Dimensions. (arXiv:2201.08082v1 [stat.ML])
    (2 min) Empirical observation of high dimensional phenomena, such as the double descent behaviour, has attracted a lot of interest in understanding classical techniques such as kernel methods, and their implications to explain generalization properties of neural networks. Many recent works analyze such models in a certain high-dimensional regime where the covariates are independent and the number of samples and the number of covariates grow at a fixed ratio (i.e. proportional asymptotics). In this work we show that for a large class of kernels, including the neural tangent kernel of fully connected networks, kernel methods can only perform as well as linear models in this regime. More surprisingly, when the data is generated by a kernel model where the relationship between input and the response could be very nonlinear, we show that linear models are in fact optimal, i.e. linear models achieve the minimum risk among all models, linear or nonlinear. These results suggest that more complex models for the data other than independent features are needed for high-dimensional analysis.
    Multi-agent Natural Actor-critic Reinforcement Learning Algorithms. (arXiv:2109.01654v2 [cs.LG] UPDATED)
    (3 min) Multi-agent actor-critic algorithms are an important part of the Reinforcement Learning paradigm. We propose three fully decentralized multi-agent natural actor-critic (MAN) algorithms in this work. The objective is to collectively find a joint policy that maximizes the average long-term return of these agents. In the absence of a central controller and to preserve privacy, agents communicate some information to their neighbors via a time-varying communication network. We prove convergence of all the 3 MAN algorithms to a globally asymptotically stable set of the ODE corresponding to actor update; these use linear function approximations. We show that the Kullback-Leibler divergence between policies of successive iterates is proportional to the objective function's gradient. We observe that the minimum singular value of the Fisher information matrix is well within the reciprocal of the policy parameter dimension. Using this, we theoretically show that the optimal value of the deterministic variant of the MAN algorithm at each iterate dominates that of the standard gradient-based multi-agent actor-critic (MAAC) algorithm. To our knowledge, it is a first such result in multi-agent reinforcement learning (MARL). To illustrate the usefulness of our proposed algorithms, we implement them on a bi-lane traffic network to reduce the average network congestion. We observe an almost 25\% reduction in the average congestion in 2 MAN algorithms; the average congestion in another MAN algorithm is on par with the MAAC algorithm. We also consider a generic $15$ agent MARL; the performance of the MAN algorithms is again as good as the MAAC algorithm.
    Data Augmentation for Bayesian Deep Learning. (arXiv:1903.09668v2 [stat.ML] UPDATED)
    (2 min) Deep Learning (DL) methods have emerged as one of the most powerful tools for functional approximation and prediction. While the representation properties of DL have been well studied, uncertainty quantification remains challenging and largely unexplored. Data augmentation techniques are a natural approach to provide uncertainty quantification and to integrate stochastic MCMC search with stochastic gradient descent (SGD) methods. The purpose of our paper is to show that training DL architectures with data augmentation leads to efficiency gains. To demonstrate our methodology, we develop data augmentation algorithms for a variety of commonly used activation functions: logit, ReLU and SVM. Our methodology is compared with traditional stochastic gradient descent with back-propagation. Our optimization procedure leads to a version of iteratively re-weighted least squares and can be implemented at scale with accelerated linear algebra methods providing substantial performance improvement. We illustrate our methodology on a number of standard datasets. Finally, we conclude with directions for future research.
    Analyzing Data Selection Techniques with Tools from the Theory of Information Losses. (arXiv:1902.09602v4 [cs.LG] UPDATED)
    (2 min) In this paper, we present and illustrate some new tools for rigorously analyzing training data selection methods. These tools focus on the information theoretic losses that occur when sampling data. We use this framework to prove that two methods, Facility Location Selection and Transductive Experimental Design, reduce these losses. These are meant to act as generalizable theoretical examples of applying the field of Information Theoretic Deep Learning Theory to the fields of data selection and active learning. Both analyses yield insight into their respective methods and increase their interpretability. In the case of Transductive Experimental Design, the provided analysis greatly increases the method's scope as well.
    Lead-lag detection and network clustering for multivariate time series with an application to the US equity market. (arXiv:2201.08283v1 [stat.ML])
    (2 min) In multivariate time series systems, it has been observed that certain groups of variables partially lead the evolution of the system, while other variables follow this evolution with a time delay; the result is a lead-lag structure amongst the time series variables. In this paper, we propose a method for the detection of lead-lag clusters of time series in multivariate systems. We demonstrate that the web of pairwise lead-lag relationships between time series can be helpfully construed as a directed network, for which there exist suitable algorithms for the detection of pairs of lead-lag clusters with high pairwise imbalance. Within our framework, we consider a number of choices for the pairwise lead-lag metric and directed network clustering components. Our framework is validated on both a synthetic generative model for multivariate lead-lag time series systems and daily real-world US equity prices data. We showcase that our method is able to detect statistically significant lead-lag clusters in the US equity market. We study the nature of these clusters in the context of the empirical finance literature on lead-lag relations and demonstrate how these can be used for the construction of predictive financial signals.
    Learning with latent group sparsity via heat flow dynamics on networks. (arXiv:2201.08326v1 [stat.ME])
    (2 min) Group or cluster structure on explanatory variables in machine learning problems is a very general phenomenon, which has attracted broad interest from practitioners and theoreticians alike. In this work we contribute an approach to learning under such group structure, that does not require prior information on the group identities. Our paradigm is motivated by the Laplacian geometry of an underlying network with a related community structure, and proceeds by directly incorporating this into a penalty that is effectively computed via a heat flow-based local network dynamics. In fact, we demonstrate a procedure to construct such a network based on the available data. Notably, we dispense with computationally intensive pre-processing involving clustering of variables, spectral or otherwise. Our technique is underpinned by rigorous theorems that guarantee its effective performance and provide bounds on its sample complexity. In particular, in a wide range of settings, it provably suffices to run the heat flow dynamics for time that is only logarithmic in the problem dimensions. We explore in detail the interfaces of our approach with key statistical physics models in network science, such as the Gaussian Free Field and the Stochastic Block Model. We validate our approach by successful applications to real-world data from a wide array of application domains, including computer science, genetics, climatology and economics. Our work raises the possibility of applying similar diffusion-based techniques to classical learning tasks, exploiting the interplay between geometric, dynamical and stochastic structures underlying the data.
    Accelerated Gradient Flow: Risk, Stability, and Implicit Regularization. (arXiv:2201.08311v1 [stat.ML])
    (2 min) Acceleration and momentum are the de facto standard in modern applications of machine learning and optimization, yet the bulk of the work on implicit regularization focuses instead on unaccelerated methods. In this paper, we study the statistical risk of the iterates generated by Nesterov's accelerated gradient method and Polyak's heavy ball method, when applied to least squares regression, drawing several connections to explicit penalization. We carry out our analyses in continuous-time, allowing us to make sharper statements than in prior work, and revealing complex interactions between early stopping, stability, and the curvature of the loss function.
    Statistical Depth Functions for Ranking Distributions: Definitions, Statistical Learning and Applications. (arXiv:2201.08105v1 [cs.LG])
    (2 min) The concept of median/consensus has been widely investigated in order to provide a statistical summary of ranking data, i.e. realizations of a random permutation $\Sigma$ of a finite set, $\{1,\; \ldots,\; n\}$ with $n\geq 1$ say. As it sheds light onto only one aspect of $\Sigma$'s distribution $P$, it may neglect other informative features. It is the purpose of this paper to define analogs of quantiles, ranks and statistical procedures based on such quantities for the analysis of ranking data by means of a metric-based notion of depth function on the symmetric group. Overcoming the absence of vector space structure on $\mathfrak{S}_n$, the latter defines a center-outward ordering of the permutations in the support of $P$ and extends the classic metric-based formulation of consensus ranking (medians corresponding then to the deepest permutations). The axiomatic properties that ranking depths should ideally possess are listed, while computational and generalization issues are studied at length. Beyond the theoretical analysis carried out, the relevance of the novel concepts and methods introduced for a wide variety of statistical tasks are also supported by numerous numerical experiments.
    Communication-Efficient Device Scheduling for Federated Learning Using Stochastic Optimization. (arXiv:2201.07912v1 [cs.LG])
    (2 min) Federated learning (FL) is a useful tool in distributed machine learning that utilizes users' local datasets in a privacy-preserving manner. When deploying FL in a constrained wireless environment; however, training models in a time-efficient manner can be a challenging task due to intermittent connectivity of devices, heterogeneous connection quality, and non-i.i.d. data. In this paper, we provide a novel convergence analysis of non-convex loss functions using FL on both i.i.d. and non-i.i.d. datasets with arbitrary device selection probabilities for each round. Then, using the derived convergence bound, we use stochastic optimization to develop a new client selection and power allocation algorithm that minimizes a function of the convergence bound and the average communication time under a transmit power constraint. We find an analytical solution to the minimization problem. One key feature of the algorithm is that knowledge of the channel statistics is not required and only the instantaneous channel state information needs to be known. Using the FEMNIST and CIFAR-10 datasets, we show through simulations that the communication time can be significantly decreased using our algorithm, compared to uniformly random participation.
    Probabilistic Simplex Component Analysis. (arXiv:2103.10027v2 [eess.SP] UPDATED)
    (2 min) This study presents PRISM, a probabilistic simplex component analysis approach to identifying the vertices of a data-circumscribing simplex from data. The problem has a rich variety of applications, the most notable being hyperspectral unmixing in remote sensing and non-negative matrix factorization in machine learning. PRISM uses a simple probabilistic model, namely, uniform simplex data distribution and additive Gaussian noise, and it carries out inference by maximum likelihood. The inference model is sound in the sense that the vertices are provably identifiable under some assumptions, and it suggests that PRISM can be effective in combating noise when the number of data points is large. PRISM has strong, but hidden, relationships with simplex volume minimization, a powerful geometric approach for the same problem. We study these fundamental aspects, and we also consider algorithmic schemes based on importance sampling and variational inference. In particular, the variational inference scheme is shown to resemble a matrix factorization problem with a special regularizer, which draws an interesting connection to the matrix factorization approach. Numerical results are provided to demonstrate the potential of PRISM.
    Spectral embedding for dynamic networks with stability guarantees. (arXiv:2106.01282v2 [stat.ML] UPDATED)
    (2 min) We consider the problem of embedding a dynamic network, to obtain time-evolving vector representations of each node, which can then be used to describe changes in behaviour of individual nodes, communities, or the entire graph. Given this open-ended remit, we argue that two types of stability in the spatio-temporal positioning of nodes are desirable: to assign the same position, up to noise, to nodes behaving similarly at a given time (cross-sectional stability) and a constant position, up to noise, to a single node behaving similarly across different times (longitudinal stability). Similarity in behaviour is defined formally using notions of exchangeability under a dynamic latent position network model. By showing how this model can be recast as a multilayer random dot product graph, we demonstrate that unfolded adjacency spectral embedding satisfies both stability conditions. We also show how two alternative methods, omnibus and independent spectral embedding, alternately lack one or the other form of stability.
    Speeding up Learning Quantum States through Group Equivariant Convolutional Quantum Ans\"atze. (arXiv:2112.07611v2 [quant-ph] UPDATED)
    (2 min) We develop a theoretical framework for $S_n$-equivariant quantum convolutional circuits, building on and significantly generalizing Jordan's Permutational Quantum Computing (PQC) formalism. We show that quantum circuits are a natural choice for Fourier space neural architectures affording a super-exponential speedup in computing the matrix elements of $S_n$-Fourier coefficients compared to the best known classical Fast Fourier Transform (FFT) over the symmetric group. In particular, we utilize the Okounkov-Vershik approach to prove Harrow's statement (Ph.D. Thesis 2005 p.160) on the equivalence between $\operatorname{SU}(d)$- and $S_n$-irrep bases and to establish the $S_n$-equivariant Convolutional Quantum Alternating Ans\"atze ($S_n$-CQA) using Young-Jucys-Murphy (YJM) elements. We prove that $S_n$-CQA are dense, thus expressible within each $S_n$-irrep block, which may serve as a universal model for potential future quantum machine learning and optimization applications. Our method provides another way to prove the universality of Quantum Approximate Optimization Algorithm (QAOA), from the representation-theoretical point of view. Our framework can be naturally applied to a wide array of problems with global $\operatorname{SU}(d)$ symmetry. We present numerical simulations to showcase the effectiveness of the ans\"atze to find the sign structure of the ground state of the $J_1$-$J_2$ antiferromagnetic Heisenberg model on the rectangular and Kagome lattices. Our work identifies quantum advantage for a specific machine learning problem, and provides the first application of the celebrated Okounkov-Vershik's representation theory to machine learning and quantum physics.
    Dimensionality reduction to maximize prediction generalization capability. (arXiv:2003.00470v2 [stat.ML] UPDATED)
    (2 min) Generalization of time series prediction remains an important open issue in machine learning, wherein earlier methods have either large generalization error or local minima. We develop an analytically solvable, unsupervised learning scheme that extracts the most informative components for predicting future inputs, termed predictive principal component analysis (PredPCA). Our scheme can effectively remove unpredictable noise and minimize test prediction error through convex optimization. Mathematical analyses demonstrate that, provided with sufficient training samples and sufficiently high-dimensional observations, PredPCA can asymptotically identify hidden states, system parameters, and dimensionalities of canonical nonlinear generative processes, with a global convergence guarantee. We demonstrate the performance of PredPCA using sequential visual inputs comprising hand-digits, rotating 3D objects, and natural scenes. It reliably estimates distinct hidden states and predicts future outcomes of previously unseen test input data, based exclusively on noisy observations. The simple architecture and low computational cost of PredPCA are highly desirable for neuromorphic hardware.
    Using Machine Learning to Test Causal Hypotheses in Conjoint Analysis. (arXiv:2201.08343v1 [stat.ME])
    (2 min) Conjoint analysis is a popular experimental design used to measure multidimensional preferences. Researchers examine how varying a factor of interest, while controlling for other relevant factors, influences decision-making. Currently, there exist two methodological approaches to analyzing data from a conjoint experiment. The first focuses on estimating the average marginal effects of each factor while averaging over the other factors. Although this allows for straightforward design-based estimation, the results critically depend on the distribution of other factors and how interaction effects are aggregated. An alternative model-based approach can compute various quantities of interest, but requires researchers to correctly specify the model, a challenging task for conjoint analysis with many factors and possible interactions. In addition, a commonly used logistic regression has poor statistical properties even with a moderate number of factors when incorporating interactions. We propose a new hypothesis testing approach based on the conditional randomization test to answer the most fundamental question of conjoint analysis: Does a factor of interest matter in any way given the other factors? Our methodology is solely based on the randomization of factors, and hence is free from assumptions. Yet, it allows researchers to use any test statistic, including those based on complex machine learning algorithms. As a result, we are able to combine the strengths of the existing design-based and model-based approaches. We illustrate the proposed methodology through conjoint analysis of immigration preferences and political candidate evaluation. We also extend the proposed approach to test for regularity assumptions commonly used in conjoint analysis.
    Multiway Spherical Clustering via Degree-Corrected Tensor Block Models. (arXiv:2201.07401v1 [math.ST] CROSS LISTED)
    (2 min) We consider the problem of multiway clustering in the presence of unknown degree heterogeneity. Such data problems arise commonly in applications such as recommendation system, neuroimaging, community detection, and hypergraph partitions in social networks. The allowance of degree heterogeneity provides great flexibility in clustering models, but the extra complexity poses significant challenges in both statistics and computation. Here, we develop a degree-corrected tensor block model with estimation accuracy guarantees. We present the phase transition of clustering performance based on the notion of angle separability, and we characterize three signal-to-noise regimes corresponding to different statistical-computational behaviors. In particular, we demonstrate that an intrinsic statistical-to-computational gap emerges only for tensors of order three or greater. Further, we develop an efficient polynomial-time algorithm that provably achieves exact clustering under mild signal conditions. The efficacy of our procedure is demonstrated through two data applications, one on human brain connectome project, and another on Peru Legislation network dataset.
    BINAS: Bilinear Interpretable Neural Architecture Search. (arXiv:2110.12399v2 [cs.LG] UPDATED)
    (2 min) Practical use of neural networks often involves requirements on latency, energy and memory among others. A popular approach to find networks under such requirements is through constrained Neural Architecture Search (NAS). However, previous methods use complicated predictors for the accuracy of the network. Those predictors are hard to interpret and sensitive to many hyperparameters to be tuned, hence, the resulting accuracy of the generated models is often harmed. In this work we resolve this by introducing Bilinear Interpretable Neural Architecture Search (BINAS), that is based on an accurate and simple bilinear formulation of both an accuracy estimator and the expected resource requirement, together with a scalable search method with theoretical guarantees. The simplicity of our proposed estimator together with the intuitive way it is constructed bring interpretability through many insights about the contribution of different design choices. For example, we find that in the examined search space, adding depth and width is more effective at deeper stages of the network and at the beginning of each resolution stage. Our experiments show that BINAS generates comparable to or better architectures than other state-of-the-art NAS methods within a reduced marginal search cost, while strictly satisfying the resource constraints.
    Heavy-tailed Sampling via Transformed Unadjusted Langevin Algorithm. (arXiv:2201.08349v1 [math.ST])
    (2 min) We analyze the oracle complexity of sampling from polynomially decaying heavy-tailed target densities based on running the Unadjusted Langevin Algorithm on certain transformed versions of the target density. The specific class of closed-form transformation maps that we construct are shown to be diffeomorphisms, and are particularly suited for developing efficient diffusion-based samplers. We characterize the precise class of heavy-tailed densities for which polynomial-order oracle complexities (in dimension and inverse target accuracy) could be obtained, and provide illustrative examples. We highlight the relationship between our assumptions and functional inequalities (super and weak Poincar\'e inequalities) based on non-local Dirichlet forms defined via fractional Laplacian operators, used to characterize the heavy-tailed equilibrium densities of certain stable-driven stochastic differential equations.
    Priors, Hierarchy, and Information Asymmetry for Skill Transfer in Reinforcement Learning. (arXiv:2201.08115v1 [cs.AI])
    (2 min) The ability to discover behaviours from past experience and transfer them to new tasks is a hallmark of intelligent agents acting sample-efficiently in the real world. Equipping embodied reinforcement learners with the same ability may be crucial for their successful deployment in robotics. While hierarchical and KL-regularized RL individually hold promise here, arguably a hybrid approach could combine their respective benefits. Key to these fields is the use of information asymmetry to bias which skills are learnt. While asymmetric choice has a large influence on transferability, prior works have explored a narrow range of asymmetries, primarily motivated by intuition. In this paper, we theoretically and empirically show the crucial trade-off, controlled by information asymmetry, between the expressivity and transferability of skills across sequential tasks. Given this insight, we provide a principled approach towards choosing asymmetry and apply our approach to a complex, robotic block stacking domain, unsolvable by baselines, demonstrating the effectiveness of hierarchical KL-regularized RL, coupled with correct asymmetric choice, for sample-efficient transfer learning.
    Predictive Inference with Weak Supervision. (arXiv:2201.08315v1 [stat.ML])
    (2 min) The expense of acquiring labels in large-scale statistical machine learning makes partially and weakly-labeled data attractive, though it is not always apparent how to leverage such data for model fitting or validation. We present a methodology to bridge the gap between partial supervision and validation, developing a conformal prediction framework to provide valid predictive confidence sets -- sets that cover a true label with a prescribed probability, independent of the underlying distribution -- using weakly labeled data. To do so, we introduce a (necessary) new notion of coverage and predictive validity, then develop several application scenarios, providing efficient algorithms for classification and several large-scale structured prediction problems. We corroborate the hypothesis that the new coverage definition allows for tighter and more informative (but valid) confidence sets through several experiments.
    NeurWIN: Neural Whittle Index Network For Restless Bandits Via Deep RL. (arXiv:2110.02128v2 [cs.LG] UPDATED)
    (2 min) Whittle index policy is a powerful tool to obtain asymptotically optimal solutions for the notoriously intractable problem of restless bandits. However, finding the Whittle indices remains a difficult problem for many practical restless bandits with convoluted transition kernels. This paper proposes NeurWIN, a neural Whittle index network that seeks to learn the Whittle indices for any restless bandits by leveraging mathematical properties of the Whittle indices. We show that a neural network that produces the Whittle index is also one that produces the optimal control for a set of Markov decision problems. This property motivates using deep reinforcement learning for the training of NeurWIN. We demonstrate the utility of NeurWIN by evaluating its performance for three recently studied restless bandit problems. Our experiment results show that the performance of NeurWIN is significantly better than other RL algorithms.
    Factor-augmented tree ensembles. (arXiv:2111.14000v2 [stat.ML] UPDATED)
    (2 min) This manuscript proposes to extend the information set of time-series regression trees with latent stationary factors extracted via state-space methods. In doing so, this approach generalises time-series regression trees on two dimensions. First, it allows to handle predictors that exhibit measurement error, non-stationary trends, seasonality and/or irregularities such as missing observations. Second, it gives a transparent way for using domain-specific theory to inform time-series regression trees. As a byproduct, this technique sets the foundations for structuring powerful ensembles. Their real-world applicability is studied under the lenses of empirical macro-finance.
    SSSNET: Semi-Supervised Signed Network Clustering. (arXiv:2110.06623v2 [cs.SI] UPDATED)
    (2 min) Node embeddings are a powerful tool in the analysis of networks; yet, their full potential for the important task of node clustering has not been fully exploited. In particular, most state-of-the-art methods generating node embeddings of signed networks focus on link sign prediction, and those that pertain to node clustering are usually not graph neural network (GNN) methods. Here, we introduce a novel probabilistic balanced normalized cut loss for training nodes in a GNN framework for semi-supervised signed network clustering, called SSSNET. The method is end-to-end in combining embedding generation and clustering without an intermediate step; it has node clustering as main focus, with an emphasis on polarization effects arising in networks. The main novelty of our approach is a new take on the role of social balance theory for signed network embeddings. The standard heuristic for justifying the criteria for the embeddings hinges on the assumption that "an enemy's enemy is a friend". Here, instead, a neutral stance is assumed on whether or not the enemy of an enemy is a friend. Experimental results on various data sets, including a synthetic signed stochastic block model, a polarized version of it, and real-world data at different scales, demonstrate that SSSNET can achieve comparable or better results than state-of-the-art spectral clustering methods, for a wide range of noise and sparsity levels. SSSNET complements existing methods through the possibility of including exogenous information, in the form of node-level features or labels.
    Statistical Learning for Individualized Asset Allocation. (arXiv:2201.07998v1 [stat.ML])
    (2 min) We establish a high-dimensional statistical learning framework for individualized asset allocation. Our proposed methodology addresses continuous-action decision-making with a large number of characteristics. We develop a discretization approach to model the effect from continuous actions and allow the discretization level to be large and diverge with the number of observations. The value function of continuous-action is estimated using penalized regression with generalized penalties that are imposed on linear transformations of the model coefficients. We show that our estimators using generalized folded concave penalties enjoy desirable theoretical properties and allow for statistical inference of the optimal value associated with optimal decision-making. Empirically, the proposed framework is exercised with the Health and Retirement Study data in finding individualized optimal asset allocation. The results show that our individualized optimal strategy improves individual financial well-being and surpasses benchmark strategies.
    Exact Statistical Inference for the Wasserstein Distance by Selective Inference. (arXiv:2109.14206v3 [stat.ML] UPDATED)
    (2 min) In this paper, we study statistical inference for the Wasserstein distance, which has attracted much attention and has been applied to various machine learning tasks. Several studies have been proposed in the literature, but almost all of them are based on asymptotic approximation and do not have finite-sample validity. In this study, we propose an exact (non-asymptotic) inference method for the Wasserstein distance inspired by the concept of conditional Selective Inference (SI). To our knowledge, this is the first method that can provide a valid confidence interval (CI) for the Wasserstein distance with finite-sample coverage guarantee, which can be applied not only to one-dimensional problems but also to multi-dimensional problems. We evaluate the performance of the proposed method on both synthetic and real-world datasets.
    Generalizing Off-Policy Evaluation From a Causal Perspective For Sequential Decision-Making. (arXiv:2201.08262v1 [cs.LG])
    (2 min) Assessing the effects of a policy based on observational data from a different policy is a common problem across several high-stake decision-making domains, and several off-policy evaluation (OPE) techniques have been proposed. However, these methods largely formulate OPE as a problem disassociated from the process used to generate the data (i.e. structural assumptions in the form of a causal graph). We argue that explicitly highlighting this association has important implications on our understanding of the fundamental limits of OPE. First, this implies that current formulation of OPE corresponds to a narrow set of tasks, i.e. a specific causal estimand which is focused on prospective evaluation of policies over populations or sub-populations. Second, we demonstrate how this association motivates natural desiderata to consider a general set of causal estimands, particularly extending the role of OPE for counterfactual off-policy evaluation at the level of individuals of the population. A precise description of the causal estimand highlights which OPE estimands are identifiable from observational data under the stated generative assumptions. For those OPE estimands that are not identifiable, the causal perspective further highlights where more experimental data is necessary, and highlights situations where human expertise can aid identification and estimation. Furthermore, many formalisms of OPE overlook the role of uncertainty entirely in the estimation process.We demonstrate how specifically characterising the causal estimand highlights the different sources of uncertainty and when human expertise can naturally manage this uncertainty. We discuss each of these aspects as actionable desiderata for future OPE research at scale and in-line with practical utility.
    Sketch-and-Lift: Scalable Subsampled Semidefinite Program for $K$-means Clustering. (arXiv:2201.08226v1 [stat.ML])
    (2 min) Semidefinite programming (SDP) is a powerful tool for tackling a wide range of computationally hard problems such as clustering. Despite the high accuracy, semidefinite programs are often too slow in practice with poor scalability on large (or even moderate) datasets. In this paper, we introduce a linear time complexity algorithm for approximating an SDP relaxed $K$-means clustering. The proposed sketch-and-lift (SL) approach solves an SDP on a subsampled dataset and then propagates the solution to all data points by a nearest-centroid rounding procedure. It is shown that the SL approach enjoys a similar exact recovery threshold as the $K$-means SDP on the full dataset, which is known to be information-theoretically tight under the Gaussian mixture model. The SL method can be made adaptive with enhanced theoretic properties when the cluster sizes are unbalanced. Our simulation experiments demonstrate that the statistical accuracy of the proposed method outperforms state-of-the-art fast clustering algorithms without sacrificing too much computational efficiency, and is comparable to the original $K$-means SDP with substantially reduced runtime.
    Practical Transfer Learning for Bayesian Optimization. (arXiv:1802.02219v3 [stat.ML] UPDATED)
    (2 min) When hyperparameter optimization of a machine learning algorithm is repeated for multiple datasets it is possible to transfer knowledge to an optimization run on a new dataset. We develop a new hyperparameter-free ensemble model for Bayesian optimization that is a generalization of two existing transfer learning extensions to Bayesian optimization and establish a worst-case bound compared to vanilla Bayesian optimization. Using a large collection of hyperparameter optimization benchmark problems, we demonstrate that our contributions substantially reduce optimization time compared to standard Gaussian process-based Bayesian optimization and improve over the current state-of-the-art for transfer hyperparameter optimization.
  • Data Science Central

    5 Alternatives to Search Engine Optimization
    (4 min) It is not a coincidence that search engine optimization is the ‘holy cow’ of internet traffic. It is responsible for more than half of it. Every second, Google alone processes nearly 100,000 search queries. Therefore, content creators do their best to exploit SEO tricks and gimmicks to their advantage and push their websites to the… Read More »5 Alternatives to Search Engine Optimization The post 5 Alternatives to Search Engine Optimization appeared first on Data Science Central.
    The Shape of Data
    (5 min) In the last few years, a very interesting concept has emerged from the realm of semantics: data shapes. This idea has been touched upon elsewhere – constraint modeling with Schematron in XML (much of which was later absorbed in the XSD 1.1 specification) and even to a certain extent in OWL. However, shapes have for… Read More »The Shape of Data The post The Shape of Data appeared first on Data Science Central.
  • John D. Cook

    Modal axioms and rules for interplanetary travel
    (3 min) The previous post pointed out the analogy between models for modal logic (i.e. Kripke semantics) and science fiction. Rules for relationships between points in a Kripke model are analogous to rules for interplanetary travel in a fictional universe. This post will expand on this last point. The following cube shows the relationships between eight different […] Modal axioms and rules for interplanetary travel first appeared on John D. Cook.
  • Artificial Intelligence

    AI Girlfriend App Predictions
    (2 min) TL;DR: AI Girlfriend App is coming. Whoever gets it right first will make more money than you can imagine. I know replica exists, but the fact that it hasn't gone mainstream speaks for itself. It's interesting, but not outlandishly amazing. Let us assume that these two points are true. An app that offers an experience roughly comparable to the OS from the movie "Her" would sell millions and millions of copies, even as an expensive subscription. (Let's say $20 per month). People are lonely. Transformers are able to generate compelling conversations. GPT-3 still hallucinates a bit too much, but GPT-4 and competing future networks will be able to do this consistently. Then the only conclusion is that while such an app will be incredibly expensive to maintain, it will also make more m…
    Artificial Nightmares : Wrath Of The Lich King || Pytti VQGAN AI Art Video [4K 60 FPS]
    (0 min) submitted by /u/Thenamessd [link] [comments]
    The text and the matrix video is polygonized with 300 polygons that evolve, frame by frame, by using an Evolutionary Algorithm in order to match the scene. Hope you like it!
    (1 min) submitted by /u/ValianTek_World [link] [comments]
    Animate Your Pictures Realistically With AI !
    (0 min) submitted by /u/OnlyProggingForFun [link] [comments]
    i made a fight of two ai in hedgewars
    (0 min) submitted by /u/antekgort200 [link] [comments]
    Is Artbreeder really the best AI photo editing software?
    (1 min) I've been discussing this with a friend for half a month and he says it's the best AI when it comes to editing photos easily. Is it really true? I do not mean something like photoshop, I mean a AI which can make you laugh/ open your eyes nearly automaticaly submitted by /u/xXLisa28Xx [link] [comments]
    Artificial Nightmares : Arcane Jinx || Pytti VQGAN AI Art Video [4K 60 FPS]
    (0 min) submitted by /u/Thenamessd [link] [comments]
    Deal: IBM To Sell Watson Health Assets To PE Firm Francisco Partners
    (0 min) submitted by /u/The-Techie [link] [comments]
    Artificial Nightmares : Cthulhu Is N'zoth || Pytti VQGAN AI Art Video [4K 60 FPS]
    (1 min) submitted by /u/Thenamessd [link] [comments]
  • Machine Learning

    What are your favourite Machine Learning topics that you think haven't been researched thoroughly enough? [D]
    (1 min) Hi All, I am an undergrad student looking for PhD research ideas. I have been guided by my supervisor to research the effects of randomisation in Machine Learning but I do not know where to start. I have looked at confounding variables and that some randomisation techniques can mitigate the effects of the confounding variables on the results. My background is in Electronic Engineering, but I am wanting to veer more on the software engineering side of things. Are there any research topics that you find interesting in ML (don't have to be around randomisation) that you think need more research? submitted by /u/dumb_persn [link] [comments]
    [P] Documentation generated using AI
    (2 min) submitted by /u/infinitlybana [link] [comments]
    [Discussion] Can GANs be used to measure simulators fidelity?
    (1 min) The Dataset will be real-world sequential process data. The Generator will output the same process within a Simulator. submitted by /u/Altruistic_Case6397 [link] [comments]
    [D] Interview w Francois Chollet: Learning new topics, Research, JAX & Keras API
    (1 min) Hi Everybody! I have been hosting a reading group for the past few weeks around Francois' latest book. The reading group would not fit in this subreddit hence I didn't post about it here. However, we had the privilege to start the reading series with an interview with Francois himself, we spoke about approaching research, the Keras API, JAX and beyond. I realised it might be of interest to people in this community, hence I'm sharing it now :) I hope you'll enjoy the discussion/interview as much as I did! ​ Link to recording: https://www.youtube.com/watch?v=1MGDzUDMV3g submitted by /u/init__27 [link] [comments]
    [R] Hierarchical Kinematic Probability Distributions for 3D Human Shape and Pose Estimation from Images in the Wild
    (1 min) Github Paper Results Video Monocular 3D human pose and shape estimation is an ill-posed problem, wherein multiple 3D bodies may project to the same 2D image - particularly when dealing with occlusions. This work attempts to predict a distribution over 3D body shape and pose conditioned on an input image, instead of a single solution. Multiple plausible 3D bodies can be sampled from the predicted distributions. Aleatoric uncertainty (e.g. due to occlusions or depth ambiguity) may be estimated as the variance of the predicted distributions. submitted by /u/synonymous1964 [link] [comments]
    [R] Active Predictive Coding Networks: A Neural Solution to the Problem of Learning Reference Frames and Part-Whole Hierarchies
    (0 min) submitted by /u/hardmaru [link] [comments]
    [R] A convolutional neural network that reads your motor intentions through EEG.
    (1 min) This is an early work that leans on the current state of the art! I'm looking for new ideas to bring it online with less than 1/2 second latency! Without success, I have already tried numerous approaches, including recurrent neural networks. Do you have any ideas? What do you think? Paper: https://iopscience.iop.org/article/10.1088/1741-2552/ac4430 Code: https://github.com/Kubasinska/MI-EEG-1D-CNN submitted by /u/nientepanico [link] [comments]
    [P] Chrome Browser Extension to get Research Paper Details
    (2 min) Link: https://chrome.google.com/webstore/detail/paperslabmlai/agkeoekbbkpokfbpmmminnombikadbhb Identifies research papers mentioned in websites you visit, and the extension shows you the following details about research papers, Two-line summary. Availability of source code, videos, Reddit and Hacker News discussions. Popularity on Twitter. Conferences. Also, we released the source code of the extension, Github: https://github.com/labmlai/chrome-extension We love to hear your feedback and suggestions. Thank you all, and I appreciate the support. submitted by /u/hnipun [link] [comments]
  • Neural Networks, Deep Learning and Machine Learning

    Disco Diffusion v4.1 - my sample.
    (1 min) Well, since this is such a hype, then I decided to play with this neuron. I made a request "Night Moscow, painted by Van Gogh". I didn't finish anything, but it's already beautiful. Of course, this neuron introduces a lot of surrealism into the drawings, but who cares :-) https://preview.redd.it/mhyfv64oq9d81.png?width=1280&format=png&auto=webp&s=3b0199599933c231bffef12e569f01676cdb0db4 https://preview.redd.it/y0xdr38rq9d81.png?width=1280&format=png&auto=webp&s=84ce15bb4f4985561604ff7bb5015a89f65194c1 Under the last picture, the neuron even tried to put the artist's signature. submitted by /u/MaxBlacksmith [link] [comments]
  • Reinforcement Learning

    Unreal vs unity?
    (1 min) I started learning reinforcement learning and i really appreciate it. I am thinking about learning a game engine so i can apply what i have learned what do you ‏‏‎ recomment unity or unreal engine and if you recommend one can u send some tutorials to begin with Thanks in advance submitted by /u/United_Dark5818 [link] [comments]
    "Active Predictive Coding Networks: A Neural Solution to the Problem of Learning Reference Frames and Part-Whole Hierarchies", Gklezakos & Rao 2022
    (1 min) submitted by /u/gwern [link] [comments]
    why do we need value networks?
    (2 min) why do we need a baseline estimate(the value network)? can't i just use the actual reward to calculate the gradients? well then what about long term rewards you ask? meaning it should not just act at for short term rewards but rather plan for long term rewards. can i make the argument that taking actions for short term rewards automatically leads to better long term rewards? (if the problem is greedy) but now that i think about it, i can't really think about problems that aren't greedy. well, maybe games like hill climb racing (where you have limited fuel) aren't greedy, but then in games like that when the agent is optimized for short term rewards, it collects fuel when it encounters one, which leads to better long term rewards. so why the discounted sum of rewards? this especially makes sense in pixel-based games cause given a state i don't know how far the "finish line" or "end" is so the value network will not be able to do much to improve the process. can someone help me understand? submitted by /u/jedi1026 [link] [comments]
    a quick question
    (1 min) how long does it usually take to train a rl model which is using ppo? i am training an agent to play the supertuxkart game and i've been training it for nealy 6 hours now and it doesn't seem to learn. note: i'm using 3 parallel envs to collect training data and the obs space is (400, 200, 3) image. i'm using a 1660 for training. if it doesn't take that long where do you guys think i might have gone wrong? i know it's hard to tell without looking at the code, but any wild guesses? edit: https://github.com/jedi2610/tuxkart-ai edit 2: it's just a normal cnn arch. no fancy vae, curl, or anything at all. just resnets stacked up submitted by /u/jedi1026 [link] [comments]
    Reinforcement learning in Real Time Bidding
    (1 min) Hey guys, have any of you worked on Real Time Bidding (RTB) related problems using RL? If so, how good has the RL agent been in optimising the KPI (CPC, CPM...etc). Looking for some advice in solving a problem in CPC optimization for my university thesis. Any help is appreciated. Thanks submitted by /u/rafiki6633 [link] [comments]
    Reward Function Literature
    (1 min) Do you know of any articles, surveys or literature reviews about the reward function? I have been reading different articles that mention individually some methodologies, techniques, challenges that they used or had, for example (reward shaping, credit assignment for MARL, vector reward for MultiObjective, reward sparsity) but I have not found an article that talks about all together. submitted by /u/pulze9 [link] [comments]
    Best approach for a Twin-Stick Shooter
    (1 min) I rewrote robotron-2084 as an openai gym environment. (Please use it and give feedback!) Example Game Image I'm trying to figure out the best way to train it and was hoping for advice. As the title suggests, we have 2 9-state inputs. (8 cardinal directions and noop.) So, we have Discrete(81) states. We can throw away the noops and just always be moving/shooting to knock it down to 64. I've been attempting to train it using Stable Baselines3, but after a couple days of running, there is no progress. (policy_type: MlpPolicy, Learning Rate: 0.000001,) I was thinking of trying to split it up into two models: Movement that rewards on time alive, and shooting that rewards on score. I don't see an easy way to do this using any existing tools. Anyone have recommendations before I go about writing a learning function on my own. Thanks for any help you can give! submitted by /u/stridera [link] [comments]
    Value range clipping technique in on-policy algorithm
    (1 min) Hi, I did ablation study about discount factor on my customized environment. When I increased discount factor, then entire value function estimation was higher than before(0.6 -> 3). The result was expected, but there was also a question of whether the value expectation should be clipped to the same range for stability(stable value gradient descent). What do you think? Very thanks! submitted by /u/Spiritual_Fig3632 [link] [comments]
    Solving a Rubik's Cube from Scratch;
    (2 min) ​ https://i.redd.it/lfjz74cn6wc61.gif For my final year university project I trained an AI to solve a Rubik's Cube purely using reinforcement learning. This project follows the algorithm written by this paper. http://deepcube.igb.uci.edu/static/files/SolvingTheRubiksCubeWithDeepReinforcementLearningAndSearch_Final.pdf. This algorithm works by first training a neural network to output a guess on the number of moves away from the solved position given an initial scrambled position. This was done using simple value iteration. The training dataset was created on the fly by randomly scrambling cubes with depths of 1 to 40. Once training is completed this neural network can be used to solve cubes by using it as a heuristic in an A* search. The classic A* search algorithm was changed to include a depth weighting which trades optimality with speed. Training took around 7 days using one Tesla P100 GPU. Parallel Training definitely should have been used however this would have taken a bunch of work to implement so this was left out. This also meant hyperparameter tuning and network architecture experimenting was pretty limited. Compared to the results in the paper, my AI is slower and less optimal, solving on average taking 60 seconds with solution lengths around 40. However I was extremely happy with the results as I had neither the computational power or experience of the researchers and comparatively with most of the other projects on Github, being able to solve a 3x3 cube at all is an achievement. This algorithm can be transferred to many other puzzles. I‍ have successfully trained the 2x2 Cube, 15-Puzzle and 24-Puzzle as well. My github page for the code is here https://github.com/PhadonP/Rubiks-Cube-Reinforcement-Learning. There are many more details shown in the pdf report found in the repo. submitted by /u/beautifulduck970f [link] [comments]

2022-01-21

  • cs.LG updates on arXiv.org

    Towards an Understanding of Default Policies in Multitask Policy Optimization. (arXiv:2111.02994v3 [cs.LG] UPDATED)
    (2 min) Much of the recent success of deep reinforcement learning has been driven by regularized policy optimization (RPO) algorithms with strong performance across multiple domains. In this family of methods, agents are trained to maximize cumulative reward while penalizing deviation in behavior from some reference, or default policy. In addition to empirical success, there is a strong theoretical foundation for understanding RPO methods applied to single tasks, with connections to natural gradient, trust region, and variational approaches. However, there is limited formal understanding of desirable properties for default policies in the multitask setting, an increasingly important domain as the field shifts towards training more generally capable agents. Here, we take a first step towards filling this gap by formally linking the quality of the default policy to its effect on optimization. Using these results, we then derive a principled RPO algorithm for multitask learning with strong performance guarantees.
    A Regularized Limited Memory BFGS method for Large-Scale Unconstrained Optimization and its Efficient Implementations. (arXiv:2101.04413v1 [math.OC] CROSS LISTED)
    (2 min) The limited memory BFGS (L-BFGS) method is one of the popular methods for solving large-scale unconstrained optimization. Since the standard L-BFGS method uses a line search to guarantee its global convergence, it sometimes requires a large number of function evaluations. To overcome the difficulty, we propose a new L-BFGS with a certain regularization technique. We show its global convergence under the usual assumptions. In order to make the method more robust and efficient, we also extend it with several techniques such as nonmonotone technique and simultaneous use of the Wolfe line search. Finally, we present some numerical results for test problems in CUTEst, which show that the proposed method is robust in terms of solving number of problems.
    Beta Shapley: a Unified and Noise-reduced Data Valuation Framework for Machine Learning. (arXiv:2110.14049v2 [cs.LG] UPDATED)
    (2 min) Data Shapley has recently been proposed as a principled framework to quantify the contribution of individual datum in machine learning. It can effectively identify helpful or harmful data points for a learning algorithm. In this paper, we propose Beta Shapley, which is a substantial generalization of Data Shapley. Beta Shapley arises naturally by relaxing the efficiency axiom of the Shapley value, which is not critical for machine learning settings. Beta Shapley unifies several popular data valuation methods and includes data Shapley as a special case. Moreover, we prove that Beta Shapley has several desirable statistical properties and propose efficient algorithms to estimate it. We demonstrate that Beta Shapley outperforms state-of-the-art data valuation methods on several downstream ML tasks such as: 1) detecting mislabeled training data; 2) learning with subsamples; and 3) identifying points whose addition or removal have the largest positive or negative impact on the model.
    Building separable approximations for quantum states via neural networks. (arXiv:2112.08055v2 [quant-ph] UPDATED)
    (2 min) Finding the closest separable state to a given target state is a notoriously difficult task, even more difficult than deciding whether a state is entangled or separable. To tackle this task, we parametrize separable states with a neural network and train it to minimize the distance to a given target state, with respect to a differentiable distance, such as the trace distance or Hilbert-Schmidt distance. By examining the output of the algorithm, we can deduce whether the target state is entangled or not, and construct an approximation for its closest separable state. We benchmark the method on a variety of well-known classes of bipartite states and find excellent agreement, even up to local dimension of $d=10$. Moreover, we show our method to be efficient in the multipartite case, considering different notions of separability. Examining three and four-party GHZ and W states we recover known bounds and obtain novel ones, for instance for triseparability. Finally, we show how to use the neural network's results to gain analytic insight.
    Code Sophistication: From Code Recommendation to Logic Recommendation. (arXiv:2201.07674v1 [cs.SE])
    (2 min) A typical approach to programming is to first code the main execution scenario, and then focus on filling out alternative behaviors and corner cases. But, almost always, there exist unusual conditions that trigger atypical behaviors, which are hard to predict in program specifications, and are thus often not coded. In this paper, we consider the problem of detecting and recommending such missing behaviors, a task that we call code sophistication. Previous research on coding assistants usually focuses on recommending code fragments based on specifications of the intended behavior. In contrast, code sophistication happens in the absence of a specification, aiming to help developers complete the logic of their programs with missing and unspecified behaviors. We outline the research challenges to this problem and present early results showing how program logic can be completed by leveraging code structure and information about the usage of input parameters.
    Top-Down Influence? Predicting CEO Personality and Risk Impact from Speech Transcripts. (arXiv:2201.07670v1 [cs.CL])
    (2 min) How much does a CEO's personality impact the performance of their company? Management theory posits a great influence, but it is difficult to show empirically -- there is a lack of publicly available self-reported personality data of top managers. Instead, we propose a text-based personality regressor using crowd-sourced Myers--Briggs Type Indicator (MBTI) assessments. The ratings have a high internal and external validity and can be predicted with moderate to strong correlations for three out of four dimensions. Providing evidence for the upper echelons theory, we demonstrate that the predicted CEO personalities have explanatory power of financial risk.
    FedTune: Automatic Tuning of Federated Learning Hyper-Parameters from System Perspective. (arXiv:2110.03061v3 [cs.LG] UPDATED)
    (2 min) Federated learning (FL) hyper-parameters significantly affect the training overheads in terms of computation time, transmission time, computation load, and transmission load. However, the current practice of manually selecting FL hyper-parameters puts a high burden on FL practitioners since various applications prefer different training preferences. In this paper, we propose FedTune, an automatic FL hyper-parameter tuning algorithm tailored to applications' diverse system requirements of FL training. FedTune is lightweight and flexible, achieving 4.18%-22.48% improvement for different datasets compared to fixed FL hyper-parameters. FedTune is available at \url{https://github.com/dtczhl/FedTuning}.
    Scotch: An Efficient Secure Computation Framework for Secure Aggregation. (arXiv:2201.07730v1 [cs.CR])
    (2 min) Federated learning enables multiple data owners to jointly train a machine learning model without revealing their private datasets. However, a malicious aggregation server might use the model parameters to derive sensitive information about the training dataset used. To address such leakage, differential privacy and cryptographic techniques have been investigated in prior work, but these often result in large communication overheads or impact model performance. To mitigate this centralization of power, we propose \textsc{Scotch}, a decentralized \textit{m-party} secure-computation framework for federated aggregation that deploys MPC primitives, such as \textit{secret sharing}. Our protocol is simple, efficient, and provides strict privacy guarantees against curious aggregators or colluding data-owners with minimal communication overheads compared to other existing \textit{state-of-the-art} privacy-preserving federated learning frameworks. We evaluate our framework by performing extensive experiments on multiple datasets with promising results. \textsc{Scotch} can train the standard MLP NN with the training dataset split amongst 3 participating users and 3 aggregating servers with 96.57\% accuracy on MNIST, and 98.40\% accuracy on the Extended MNIST (digits) dataset, while providing various optimizations.
    Learning Tensor Representations for Meta-Learning. (arXiv:2201.07348v1 [cs.LG])
    (2 min) We introduce a tensor-based model of shared representation for meta-learning from a diverse set of tasks. Prior works on learning linear representations for meta-learning assume that there is a common shared representation across different tasks, and do not consider the additional task-specific observable side information. In this work, we model the meta-parameter through an order-$3$ tensor, which can adapt to the observed task features of the task. We propose two methods to estimate the underlying tensor. The first method solves a tensor regression problem and works under natural assumptions on the data generating process. The second method uses the method of moments under additional distributional assumptions and has an improved sample complexity in terms of the number of tasks. We also focus on the meta-test phase, and consider estimating task-specific parameters on a new task. Substituting the estimated tensor from the first step allows us estimating the task-specific parameters with very few samples of the new task, thereby showing the benefits of learning tensor representations for meta-learning. Finally, through simulation and several real-world datasets, we evaluate our methods and show that it improves over previous linear models of shared representations for meta-learning.
    Linear Polytree Structural Equation Models: Structural Learning and Inverse Correlation Estimation. (arXiv:2107.10955v2 [stat.ML] UPDATED)
    (2 min) We are interested in the problem of learning the directed acyclic graph (DAG) when data are generated from a linear structural equation model (SEM) and the causal structure can be characterized by a polytree. Under the Gaussian polytree models, we study sufficient conditions on the sample sizes for the well-known Chow-Liu algorithm to exactly recover both the skeleton and the equivalence class of the polytree, which is uniquely represented by a CPDAG. On the other hand, necessary conditions on the required sample sizes for both skeleton and CPDAG recovery are also derived in terms of information-theoretic lower bounds, which match the respective sufficient conditions and thereby give a sharp characterization of the difficulty of these tasks. We also consider extensions to the sub-Gaussian case, and then study the estimation of the inverse correlation matrix under such models. Our theoretical findings are illustrated by comprehensive numerical simulations, and experiments on benchmark data also demonstrate the robustness of polytree learning when the true graphical structures can only be approximated by polytrees.
    Generalized XGBoost Method. (arXiv:2109.07473v2 [cs.LG] UPDATED)
    (0 min) The XGBoost method has many advantages and is especially suitable for statistical analysis of big data, but its loss function is limited to convex functions. In many specific applications, a nonconvex loss function would be preferable. In this paper, I propose a generalized XGBoost method, which requires weaker loss function constraint and involves more general loss functions, including convex loss functions and some non-convex loss functions. Furthermore, this generalized XGBoost method is extended to multivariate loss function to form a more generalized XGBoost method. This method is a multiobjective parameter regularized tree boosting method, which can model multiple parameters in most of the frequently-used parametric probability distributions to be fitted by predictor variables. Meanwhile, the related algorithms and some examples in non-life insurance pricing are given.
    Spectro-Temporal Deep Features for Disordered Speech Assessment and Recognition. (arXiv:2201.05554v1 [cs.SD] CROSS LISTED)
    (0 min) Automatic recognition of disordered speech remains a highly challenging task to date. Sources of variability commonly found in normal speech including accent, age or gender, when further compounded with the underlying causes of speech impairment and varying severity levels, create large diversity among speakers. To this end, speaker adaptation techniques play a vital role in current speech recognition systems. Motivated by the spectro-temporal level differences between disordered and normal speech that systematically manifest in articulatory imprecision, decreased volume and clarity, slower speaking rates and increased dysfluencies, novel spectro-temporal subspace basis embedding deep features derived by SVD decomposition of speech spectrum are proposed to facilitate both accurate speech intelligibility assessment and auxiliary feature based speaker adaptation of state-of-the-art hybrid DNN and end-to-end disordered speech recognition systems. Experiments conducted on the UASpeech corpus suggest the proposed spectro-temporal deep feature adapted systems consistently outperformed baseline i-Vector adaptation by up to 2.63% absolute (8.6% relative) reduction in word error rate (WER) with or without data augmentation. Learning hidden unit contribution (LHUC) based speaker adaptation was further applied. The final speaker adapted system using the proposed spectral basis embedding features gave an overall WER of 25.6% on the UASpeech test set of 16 dysarthric speakers
    Transformers in Vision: A Survey. (arXiv:2101.01169v5 [cs.CV] UPDATED)
    (0 min) Astounding results from Transformer models on natural language tasks have intrigued the vision community to study their application to computer vision problems. Among their salient benefits, Transformers enable modeling long dependencies between input sequence elements and support parallel processing of sequence as compared to recurrent networks e.g., Long short-term memory (LSTM). Different from convolutional networks, Transformers require minimal inductive biases for their design and are naturally suited as set-functions. Furthermore, the straightforward design of Transformers allows processing multiple modalities (e.g., images, videos, text and speech) using similar processing blocks and demonstrates excellent scalability to very large capacity networks and huge datasets. These strengths have led to exciting progress on a number of vision tasks using Transformer networks. This survey aims to provide a comprehensive overview of the Transformer models in the computer vision discipline. We start with an introduction to fundamental concepts behind the success of Transformers i.e., self-attention, large-scale pre-training, and bidirectional encoding. We then cover extensive applications of transformers in vision including popular recognition tasks (e.g., image classification, object detection, action recognition, and segmentation), generative modeling, multi-modal tasks (e.g., visual-question answering, visual reasoning, and visual grounding), video processing (e.g., activity recognition, video forecasting), low-level vision (e.g., image super-resolution, image enhancement, and colorization) and 3D analysis (e.g., point cloud classification and segmentation). We compare the respective advantages and limitations of popular techniques both in terms of architectural design and their experimental value. Finally, we provide an analysis on open research directions and possible future works.
    Type4Py: Practical Deep Similarity Learning-Based Type Inference for Python. (arXiv:2101.04470v3 [cs.LG] UPDATED)
    (0 min) Dynamic languages, such as Python and Javascript, trade static typing for developer flexibility and productivity. Lack of static typing can cause run-time exceptions and is a major factor for weak IDE support. To alleviate these issues, PEP 484 introduced optional type annotations for Python. As retrofitting types to existing codebases is error-prone and laborious, machine learning (ML)-based approaches have been proposed to enable automatic type inference based on existing, partially annotated codebases. However, previous ML-based approaches are trained and evaluated on human-provided type annotations, which might not always be sound, and hence this may limit the practicality for real-world usage. In this paper, we present Type4Py, a deep similarity learning-based hierarchical neural network model. It learns to discriminate between similar and dissimilar types in a high-dimensional space, which results in clusters of types. Likely types for arguments, variables, and return values can then be inferred through the nearest neighbor search. Unlike previous work, we trained and evaluated our model on a type-checked dataset and used mean reciprocal rank (MRR) to reflect the performance perceived by users. The obtained results show that Type4Py achieves an MRR of 77.1%, which is a substantial improvement of 8.1% and 16.7% over the state-of-the-art approaches Typilus and TypeWriter, respectively. Finally, to aid developers with retrofitting types, we released a Visual Studio Code extension, which uses Type4Py to provide ML-based type auto-completion for Python.
    Superpixel Pre-Segmentation of HER2 Slides for Efficient Annotation. (arXiv:2201.07572v1 [cs.CV])
    (0 min) Supervised deep learning has shown state-of-the-art performance for medical image segmentation across different applications, including histopathology and cancer research; however, the manual annotation of such data is extremely laborious. In this work, we explore the use of superpixel approaches to compute a pre-segmentation of HER2 stained images for breast cancer diagnosis that facilitates faster manual annotation and correction in a second step. Four methods are compared: Standard Simple Linear Iterative Clustering (SLIC) as a baseline, a domain adapted SLIC, and superpixels based on feature embeddings of a pretrained ResNet-50 and a denoising autoencoder. To tackle oversegmentation, we propose to hierarchically merge superpixels, based on their content in the respective feature space. When evaluating the approaches on fully manually annotated images, we observe that the autoencoder-based superpixels achieve a 23% increase in boundary F1 score compared to the baseline SLIC superpixels. Furthermore, the boundary F1 score increases by 73% when hierarchical clustering is applied on the adapted SLIC and the autoencoder-based superpixels. These evaluations show encouraging first results for a pre-segmentation for efficient manual refinement without the need for an initial set of annotated training data.
    A Survey on Training Challenges in Generative Adversarial Networks for Biomedical Image Analysis. (arXiv:2201.07646v1 [cs.LG])
    (0 min) In biomedical image analysis, the applicability of deep learning methods is directly impacted by the quantity of image data available. This is due to deep learning models requiring large image datasets to provide high-level performance. Generative Adversarial Networks (GANs) have been widely utilized to address data limitations through the generation of synthetic biomedical images. GANs consist of two models. The generator, a model that learns how to produce synthetic images based on the feedback it receives. The discriminator, a model that classifies an image as synthetic or real and provides feedback to the generator. Throughout the training process, a GAN can experience several technical challenges that impede the generation of suitable synthetic imagery. First, the mode collapse problem whereby the generator either produces an identical image or produces a uniform image from distinct input features. Second, the non-convergence problem whereby the gradient descent optimizer fails to reach a Nash equilibrium. Thirdly, the vanishing gradient problem whereby unstable training behavior occurs due to the discriminator achieving optimal classification performance resulting in no meaningful feedback being provided to the generator. These problems result in the production of synthetic imagery that is blurry, unrealistic, and less diverse. To date, there has been no survey article outlining the impact of these technical challenges in the context of the biomedical imagery domain. This work presents a review and taxonomy based on solutions to the training problems of GANs in the biomedical imaging domain. This survey highlights important challenges and outlines future research directions about the training of GANs in the domain of biomedical imagery.
    Learning Inconsistent Preferences with Gaussian Processes. (arXiv:2006.03847v2 [stat.ML] UPDATED)
    (0 min) We revisit widely used preferential Gaussian processes by Chu et al.(2005) and challenge their modelling assumption that imposes rankability of data items via latent utility function values. We propose a generalisation of pgp which can capture more expressive latent preferential structures in the data and thus be used to model inconsistent preferences, i.e. where transitivity is violated, or to discover clusters of comparable items via spectral decomposition of the learned preference functions. We also consider the properties of associated covariance kernel functions and its reproducing kernel Hilbert Space (RKHS), giving a simple construction that satisfies universality in the space of preference functions. Finally, we provide an extensive set of numerical experiments on simulated and real-world datasets showcasing the competitiveness of our proposed method with state-of-the-art. Our experimental findings support the conjecture that violations of rankability are ubiquitous in real-world preferential data.
    Deep Capsule Encoder-Decoder Network for Surrogate Modeling and Uncertainty Quantification. (arXiv:2201.07753v1 [stat.ML])
    (0 min) We propose a novel \textit{capsule} based deep encoder-decoder model for surrogate modeling and uncertainty quantification of systems in mechanics from sparse data. The proposed framework is developed by adapting Capsule Network (CapsNet) architecture into image-to-image regression encoder-decoder network. Specifically, the aim is to exploit the benefits of CapsNet over convolution neural network (CNN) $-$ retaining pose and position information related to an entity to name a few. The performance of proposed approach is illustrated by solving an elliptic stochastic partial differential equation (SPDE), which also governs systems in mechanics such as steady heat conduction, ground water flow or other diffusion processes, based uncertainty quantification problem with an input dimensionality of $1024$. However, the problem definition does not the restrict the random diffusion field to a particular covariance structure, and the more strenuous task of response prediction for an arbitrary diffusion field is solved. The obtained results from performance evaluation indicate that the proposed approach is accurate, efficient, and robust.
    Sandbox Sample Classification Using Behavioral Indicators of Compromise. (arXiv:2201.07359v1 [cs.CR])
    (0 min) Behavioral Indicators of Compromise are associated with various automated methods used to extract the sample behavior by observing the system function calls performed in a virtual execution environment. Thus, every sample is described by a set of BICs triggered by the sample behavior in the sandbox environment. Here we discuss a Machine Learning approach to the classification of the sandbox samples as MALICIOUS or BENIGN, based on the list of triggered BICs. Besides the more traditional methods like Logistic Regression and Naive Bayes Classification we also discuss a different approach inspired by the statistical Monte Carlo methods. The numerical results are illustrated using ThreatGRID and ReversingLabs data.
    A Deep Learning Approach for Semantic Segmentation of Unbalanced Data in Electron Tomography of Catalytic Materials. (arXiv:2201.07342v1 [cond-mat.mtrl-sci])
    (0 min) Heterogeneous catalysts possess complex surface and bulk structures, relatively poor intrinsic contrast, and often a sparse distribution of the catalytic nanoparticles (NPs), posing a significant challenge for image segmentation, including the current state-of-the-art deep learning methods. To tackle this problem, we apply a deep learning-based approach for the multi-class semantic segmentation of a $\gamma$-Alumina/Pt catalytic material in a class imbalance situation. Specifically, we used the weighted focal loss as a loss function and attached it to the U-Net's fully convolutional network architecture. We assessed the accuracy of our results using Dice similarity coefficient (DSC), recall, precision, and Hausdorff distance (HD) metrics on the overlap between the ground-truth and predicted segmentations. Our adopted U-Net model with the weighted focal loss function achieved an average DSC score of 0.96 $\pm$ 0.003 in the $\gamma$-Alumina support material and 0.84 $\pm$ 0.03 in the Pt NPs segmentation tasks. We report an average boundary-overlap error of less than 2 nm at the 90th percentile of HD for $\gamma$-Alumina and Pt NPs segmentations. The complex surface morphology of the $\gamma$-Alumina and its relation to the Pt NPs were visualized in 3D by the deep learning-assisted automatic segmentation of a large data set of high-angle annular dark-field (HAADF) scanning transmission electron microscopy (STEM) tomography reconstructions.
    Conservative Distributional Reinforcement Learning with Safety Constraints. (arXiv:2201.07286v1 [cs.LG])
    (0 min) Safety exploration can be regarded as a constrained Markov decision problem where the expected long-term cost is constrained. Previous off-policy algorithms convert the constrained optimization problem into the corresponding unconstrained dual problem by introducing the Lagrangian relaxation technique. However, the cost function of the above algorithms provides inaccurate estimations and causes the instability of the Lagrange multiplier learning. In this paper, we present a novel off-policy reinforcement learning algorithm called Conservative Distributional Maximum a Posteriori Policy Optimization (CDMPO). At first, to accurately judge whether the current situation satisfies the constraints, CDMPO adapts distributional reinforcement learning method to estimate the Q-function and C-function. Then, CDMPO uses a conservative value function loss to reduce the number of violations of constraints during the exploration process. In addition, we utilize Weighted Average Proportional Integral Derivative (WAPID) to update the Lagrange multiplier stably. Empirical results show that the proposed method has fewer violations of constraints in the early exploration process. The final test results also illustrate that our method has better risk control.
    GPS: A Policy-driven Sampling Approach for Graph Representation Learning. (arXiv:2112.14482v2 [cs.LG] UPDATED)
    (0 min) Graph representation learning has drawn increasing attention in recent years, especially for learning the low dimensional embedding at both node and graph level for classification and recommendations tasks. To enable learning the representation on the large-scale graph data in the real world, numerous research has focused on developing different sampling strategies to facilitate the training process. Herein, we propose an adaptive Graph Policy-driven Sampling model (GPS), where the influence of each node in the local neighborhood is realized through the adaptive correlation calculation. Specifically, the selections of the neighbors are guided by an adaptive policy algorithm, contributing directly to the message aggregation, node embedding updating, and graph level readout steps. We then conduct comprehensive experiments against baseline methods on graph classification tasks from various perspectives. Our proposed model outperforms the existing ones by 3%-8% on several vital benchmarks, achieving state-of-the-art performance in real-world datasets.
    Is Contrastive Learning Suitable for Left Ventricular Segmentation in Echocardiographic Images?. (arXiv:2201.07219v1 [eess.IV])
    (0 min) Contrastive learning has proven useful in many applications where access to labelled data is limited. The lack of annotated data is particularly problematic in medical image segmentation as it is difficult to have clinical experts manually annotate large volumes of data. One such task is the segmentation of cardiac structures in ultrasound images of the heart. In this paper, we argue whether or not contrastive pretraining is helpful for the segmentation of the left ventricle in echocardiography images. Furthermore, we study the effect of this on two segmentation networks, DeepLabV3, as well as the commonly used segmentation network, UNet. Our results show that contrastive pretraining helps improve the performance on left ventricle segmentation, particularly when annotated data is scarce. We show how to achieve comparable results to state-of-the-art fully supervised algorithms when we train our models in a self-supervised fashion followed by fine-tuning on just 5% of the data. We also show that our solution achieves better results than what is currently published on a large public dataset (EchoNet-Dynamic) and we compare the performance of our solution on another smaller dataset (CAMUS) as well.
    PECOS: Prediction for Enormous and Correlated Output Spaces. (arXiv:2010.05878v2 [cs.LG] UPDATED)
    (0 min) Many large-scale applications amount to finding relevant results from an enormous output space of potential candidates. For example, finding the best matching product from a large catalog or suggesting related search phrases on a search engine. The size of the output space for these problems can range from millions to billions, and can even be infinite in some applications. Moreover, training data is often limited for the long-tail items in the output space. Fortunately, items in the output space are often correlated thereby presenting an opportunity to alleviate the data sparsity issue. In this paper, we propose the Prediction for Enormous and Correlated Output Spaces (PECOS) framework, a versatile and modular machine learning framework for solving prediction problems for very large output spaces, and apply it to the eXtreme Multilabel Ranking (XMR) problem: given an input instance, find and rank the most relevant items from an enormous but fixed and finite output space. We propose a three phase framework for PECOS: (i) in the first phase, PECOS organizes the output space using a semantic indexing scheme, (ii) in the second phase, PECOS uses the indexing to narrow down the output space by orders of magnitude using a machine learned matching scheme, and (iii) in the third phase, PECOS ranks the matched items using a final ranking scheme. The versatility and modularity of PECOS allows for easy plug-and-play of various choices for the indexing, matching, and ranking phases. We also develop very fast inference procedures which allow us to perform XMR predictions in real time; for example, inference takes less than 1 millisecond per input on the dataset with 2.8 million labels. The PECOS software is available at https://libpecos.org.
    Deperturbation of Online Social Networks via Bayesian Label Transition. (arXiv:2010.14121v3 [cs.LG] UPDATED)
    (0 min) Online social networks (OSNs) classify users into different categories based on their online activities and interests, a task which is referred as a node classification task. Such a task can be solved effectively using Graph Convolutional Networks (GCNs). However, a small number of users, so-called perturbators, may perform random activities on an OSN, which significantly deteriorate the performance of a GCN-based node classification task. Existing works in this direction defend GCNs either by adversarial training or by identifying the attacker nodes followed by their removal. However, both of these approaches require that the attack patterns or attacker nodes be identified first, which is difficult in the scenario when the number of perturbator nodes is very small. In this work, we develop a GCN defense model, namely GraphLT, which uses the concept of label transition. GraphLT assumes that perturbators' random activities deteriorate GCN's performance. To overcome this issue, GraphLT subsequently uses a novel Bayesian label transition model, which takes GCN's predicted labels and applies label transitions by Gibbs-sampling-based inference and thus repairs GCN's prediction to achieve better node classification. Extensive experiments on seven benchmark datasets show that GraphLT considerably enhances the performance of the node classifier in an unperturbed environment; furthermore, it validates that GraphLT can successfully repair a GCN-based node classifier with superior performance than several competing methods.
    Towards Green Automated Machine Learning: Status Quo and Future Directions. (arXiv:2111.05850v2 [cs.LG] UPDATED)
    (0 min) Automated machine learning (AutoML) strives for the automatic configuration of machine learning algorithms and their composition into an overall (software) solution - a machine learning pipeline - tailored to the learning task (dataset) at hand. Over the last decade, AutoML has developed into an independent research field with hundreds of contributions. While AutoML offers many prospects, it is also known to be quite resource-intensive, which is one of its major points of criticism. The primary cause for a high resource consumption is that many approaches rely on the (costly) evaluation of many machine learning pipelines while searching for good candidates. This problem is amplified in the context of research on AutoML methods, due to large scale experiments conducted with many datasets and approaches, each of them being run with several repetitions to rule out random effects. In the spirit of recent work on Green AI, this paper is written in an attempt to raise the awareness of AutoML researchers for the problem and to elaborate on possible remedies. To this end, we identify four categories of actions the community may take towards more sustainable research on AutoML, i.e. Green AutoML: design of AutoML systems, benchmarking, transparency and research incentives.
    Prospective Learning: Back to the Future. (arXiv:2201.07372v1 [cs.LG])
    (0 min) Research on both natural intelligence (NI) and artificial intelligence (AI) generally assumes that the future resembles the past: intelligent agents or systems (what we call 'intelligence') observe and act on the world, then use this experience to act on future experiences of the same kind. We call this 'retrospective learning'. For example, an intelligence may see a set of pictures of objects, along with their names, and learn to name them. A retrospective learning intelligence would merely be able to name more pictures of the same objects. We argue that this is not what true intelligence is about. In many real world problems, both NIs and AIs will have to learn for an uncertain future. Both must update their internal models to be useful for future tasks, such as naming fundamentally new objects and using these objects effectively in a new context or to achieve previously unencountered goals. This ability to learn for the future we call 'prospective learning'. We articulate four relevant factors that jointly define prospective learning. Continual learning enables intelligences to remember those aspects of the past which it believes will be most useful in the future. Prospective constraints (including biases and priors) facilitate the intelligence finding general solutions that will be applicable to future problems. Curiosity motivates taking actions that inform future decision making, including in previously unmet situations. Causal estimation enables learning the structure of relations that guide choosing actions for specific outcomes, even when the specific action-outcome contingencies have never been observed before. We argue that a paradigm shift from retrospective to prospective learning will enable the communities that study intelligence to unite and overcome existing bottlenecks to more effectively explain, augment, and engineer intelligences.
    Scalable Hypergraph Embedding System. (arXiv:2103.09660v3 [cs.SI] UPDATED)
    (0 min) Many problems such as vertex classification andlink prediction in network data can be solvedusing graph embeddings, and a number of algo-rithms are known for constructing such embed-dings. However, it is difficult to use graphs tocapture non-binary relations such as communitiesof vertices. These kinds of complex relations areexpressed more naturally as hypergraphs. Whilehypergraphs are a generalization of graphs, state-of-the-art graph embedding techniques are notadequate for solving prediction and classificationtasks on large hypergraphs accurately in reason-able time. In this paper, we introduce NetVec,a novel multi-level framework for scalable un-supervised hypergraph embedding, that can becoupled with any graph embedding algorithm toproduce embeddings of hypergraphs with millionsof nodes and hyperedges in a few minutes.
    Towards a General Deep Feature Extractor for Facial Expression Recognition. (arXiv:2201.07781v1 [cs.CV])
    (0 min) The human face conveys a significant amount of information. Through facial expressions, the face is able to communicate numerous sentiments without the need for verbalisation. Visual emotion recognition has been extensively studied. Recently several end-to-end trained deep neural networks have been proposed for this task. However, such models often lack generalisation ability across datasets. In this paper, we propose the Deep Facial Expression Vector ExtractoR (DeepFEVER), a new deep learning-based approach that learns a visual feature extractor general enough to be applied to any other facial emotion recognition task or dataset. DeepFEVER outperforms state-of-the-art results on the AffectNet and Google Facial Expression Comparison datasets. DeepFEVER's extracted features also generalise extremely well to other datasets -- even those unseen during training -- namely, the Real-World Affective Faces (RAF) dataset.
    RAFT: A Real-World Few-Shot Text Classification Benchmark. (arXiv:2109.14076v3 [cs.CL] UPDATED)
    (0 min) Large pre-trained language models have shown promise for few-shot learning, completing text-based tasks given only a few task-specific examples. Will models soon solve classification tasks that have so far been reserved for human research assistants? Existing benchmarks are not designed to measure progress in applied settings, and so don't directly answer this question. The RAFT benchmark (Real-world Annotated Few-shot Tasks) focuses on naturally occurring tasks and uses an evaluation setup that mirrors deployment. Baseline evaluations on RAFT reveal areas current techniques struggle with: reasoning over long texts and tasks with many classes. Human baselines show that some classification tasks are difficult for non-expert humans, reflecting that real-world value sometimes depends on domain expertise. Yet even non-expert human baseline F1 scores exceed GPT-3 by an average of 0.11. The RAFT datasets and leaderboard will track which model improvements translate into real-world benefits at https://raft.elicit.org .
    A Bayesian multiscale CNN framework to predict local stress fields in structures with microscale features. (arXiv:2012.11330v4 [cs.CE] UPDATED)
    (0 min) Multiscale computational modelling is challenging due to the high computational cost of direct numerical simulation by finite elements. To address this issue, concurrent multiscale methods use the solution of cheaper macroscale surrogates as boundary conditions to microscale sliding windows. The microscale problems remain a numerically challenging operation both in terms of implementation and cost. In this work we propose to replace the local microscale solution by an Encoder-Decoder Convolutional Neural Network that will generate fine-scale stress corrections to coarse predictions around unresolved microscale features, without prior parametrisation of local microscale problems. We deploy a Bayesian approach providing credible intervals to evaluate the uncertainty of the predictions, which is then used to investigate the merits of a selective learning framework. We will demonstrate the capability of the approach to predict equivalent stress fields in porous structures using linearised and finite strain elasticity theories.
    Lifted Primal-Dual Method for Bilinearly Coupled Smooth Minimax Optimization. (arXiv:2201.07427v1 [math.OC])
    (0 min) We study the bilinearly coupled minimax problem: $\min_{x} \max_{y} f(x) + y^\top A x - h(y)$, where $f$ and $h$ are both strongly convex smooth functions and admit first-order gradient oracles. Surprisingly, no known first-order algorithms have hitherto achieved the lower complexity bound of $\Omega((\sqrt{\frac{L_x}{\mu_x}} + \frac{\|A\|}{\sqrt{\mu_x \mu_y}} + \sqrt{\frac{L_y}{\mu_y}}) \log(\frac1{\varepsilon}))$ for solving this problem up to an $\varepsilon$ primal-dual gap in the general parameter regime, where $L_x, L_y,\mu_x,\mu_y$ are the corresponding smoothness and strongly convexity constants. We close this gap by devising the first optimal algorithm, the Lifted Primal-Dual (LPD) method. Our method lifts the objective into an extended form that allows both the smooth terms and the bilinear term to be handled optimally and seamlessly with the same primal-dual framework. Besides optimality, our method yields a desirably simple single-loop algorithm that uses only one gradient oracle call per iteration. Moreover, when $f$ is just convex, the same algorithm applied to a smoothed objective achieves the nearly optimal iteration complexity. We also provide a direct single-loop algorithm, using the LPD method, that achieves the iteration complexity of $O(\sqrt{\frac{L_x}{\varepsilon}} + \frac{\|A\|}{\sqrt{\mu_y \varepsilon}} + \sqrt{\frac{L_y}{\varepsilon}})$. Numerical experiments on quadratic minimax problems and policy evaluation problems further demonstrate the fast convergence of our algorithm in practice.
    Comprehensive Efficiency Analysis of Machine Learning Algorithms for Developing Hardware-Based Cybersecurity Countermeasures. (arXiv:2201.07654v1 [cs.CR])
    (0 min) Modern computing systems have led cyber adversaries to create more sophisticated malware than was previously available in the early days of technology. Dated detection techniques such as Anti-Virus Software (AVS) based on signature-based methods could no longer keep up with the demand that computer systems required of them. The complexity of modern malware has led to the development of contemporary detection techniques that use the machine learning field and hardware to boost the detection rates of malicious software. These new techniques use Hardware Performance Counters (HPCs) that form a digital signature of sorts. After the models are fed training data, they can reference these HPCs to classify zero-day malware samples. A problem emerges when malware with no comparable HPC values comes into contact with these new techniques. We provide an analysis of several machine learning and deep learning models that run zero-day samples and evaluate the results from the conversion of C++ algorithms to a hardware description language (HDL) used to begin a hardware implementation. Our results present a lack of accuracy from the models when running zero-day malware data as our highest detector, decision tree, was only able to reach 91.2% accuracy and had an F1-Score of 91.5% in the form of a decision tree. Next, through the Receiver Operating Curve (ROC) and area-under-the-curve (AUC), we can also determine that the algorithms did not present significant robustness as the largest AUC was only 0.819. In addition, we viewed relatively high overhead for our ensemble learning algorithm while also only having an 86.3% accuracy and 86% F1-Score. Finally, as an additional task, we adapted the one rule algorithm to fit many rules to make malware classification understandable to everyday users by allowing them to view the regulations while maintaining relatively high accuracy.
    Optimality Inductive Biases and Agnostic Guidelines for Offline Reinforcement Learning. (arXiv:2107.01407v2 [cs.LG] UPDATED)
    (0 min) The performance of state-of-the-art offline RL methods varies widely over the spectrum of dataset qualities, ranging from far-from-optimal random data to close-to-optimal expert demonstrations. We re-implement these methods to test their reproducibility, and show that when a given method outperforms the others on one end of the spectrum, it never does on the other end. This prevents us from naming a victor across the board. We attribute the asymmetry to the amount of inductive bias injected into the agent to entice it to posit that the behavior underlying the offline dataset is optimal for the task. Our investigations confirm that careless injections of such optimality inductive biases make dominant agents subpar as soon as the offline policy is sub-optimal. To bridge this gap, we generalize importance-weighted regression methods that have proved the most versatile across the spectrum of dataset grades into a modular framework that allows for the design of methods that align with how much we know about the dataset. This modularity enables qualitatively different injections of optimality inductive biases. We show that certain orchestrations strike the right balance, improving the return on one end of the spectrum without harming it on the other end. While the formulation of guidelines for the design of an offline method reduces to aligning the amount of optimality bias to inject with what we know about the quality of the data, the design of an agnostic method for which we need not know the quality of the data beforehand is more nuanced. Only our framework allowed us to design a method that performed well across the spectrum while remaining modular if more information about the quality of the data ever becomes available.
    Dual Space Graph Contrastive Learning. (arXiv:2201.07409v1 [cs.LG])
    (0 min) Unsupervised graph representation learning has emerged as a powerful tool to address real-world problems and achieves huge success in the graph learning domain. Graph contrastive learning is one of the unsupervised graph representation learning methods, which recently attracts attention from researchers and has achieved state-of-the-art performances on various tasks. The key to the success of graph contrastive learning is to construct proper contrasting pairs to acquire the underlying structural semantics of the graph. However, this key part is not fully explored currently, most of the ways generating contrasting pairs focus on augmenting or perturbating graph structures to obtain different views of the input graph. But such strategies could degrade the performances via adding noise into the graph, which may narrow down the field of the applications of graph contrastive learning. In this paper, we propose a novel graph contrastive learning method, namely \textbf{D}ual \textbf{S}pace \textbf{G}raph \textbf{C}ontrastive (DSGC) Learning, to conduct graph contrastive learning among views generated in different spaces including the hyperbolic space and the Euclidean space. Since both spaces have their own advantages to represent graph data in the embedding spaces, we hope to utilize graph contrastive learning to bridge the spaces and leverage advantages from both sides. The comparison experiment results show that DSGC achieves competitive or better performances among all the datasets. In addition, we conduct extensive experiments to analyze the impact of different graph encoders on DSGC, giving insights about how to better leverage the advantages of contrastive learning between different spaces.
    The Enforcers: Consistent Sparse-Discrete Methods for Constraining Informative Emergent Communication. (arXiv:2201.07452v1 [cs.LG])
    (0 min) Communication enables agents to cooperate to achieve their goals. Learning when to communicate, i.e. sparse communication, is particularly important where bandwidth is limited, in situations where agents interact with humans, in partially observable scenarios where agents must convey information unavailable to others, and in non-cooperative scenarios where agents may hide information to gain a competitive advantage. Recent work in learning sparse communication, however, suffers from high variance training where, the price of decreasing communication is a decrease in reward, particularly in cooperative tasks. Sparse communications are necessary to match agent communication to limited human bandwidth. Humans additionally communicate via discrete linguistic tokens, previously shown to decrease task performance when compared to continuous communication vectors. This research addresses the above issues by limiting the loss in reward of decreasing communication and eliminating the penalty for discretization. In this work, we successfully constrain training using a learned gate to regulate when to communicate while using discrete prototypes that reflect what to communicate for cooperative tasks with partial observability. We provide two types of "Enforcers" for hard and soft budget constraints and present results of communication under different budgets. We show that our method satisfies constraints while yielding the same performance as comparable, unconstrained methods.
    AI-based Carcinoma Detection and Classification Using Histopathological Images: A Systematic Review. (arXiv:2201.07231v1 [eess.IV])
    (0 min) Histopathological image analysis is the gold standard to diagnose cancer. Carcinoma is a subtype of cancer that constitutes more than 80% of all cancer cases. Squamous cell carcinoma and adenocarcinoma are two major subtypes of carcinoma, diagnosed by microscopic study of biopsy slides. However, manual microscopic evaluation is a subjective and time-consuming process. Many researchers have reported methods to automate carcinoma detection and classification. The increasing use of artificial intelligence (AI) in the automation of carcinoma diagnosis also reveals a significant rise in the use of deep network models. In this systematic literature review, we present a comprehensive review of the state-of-the-art approaches reported in carcinoma diagnosis using histopathological images. Studies are selected from well-known databases with strict inclusion/exclusion criteria. We have categorized the articles and recapitulated their methods based on specific organs of carcinoma origin. Further, we have summarized pertinent literature on AI methods, highlighted critical challenges and limitations, and provided insights on future research direction in automated carcinoma diagnosis. Out of 101 articles selected, most of the studies experimented on private datasets with varied image sizes, obtaining accuracy between 63% and 100%. Overall, this review highlights the need for a generalized AI-based carcinoma diagnostic system. Additionally, it is desirable to have accountable approaches to extract microscopic features from images of multiple magnifications that should mimic pathologists' evaluations.
    Towards holistic scene understanding: Semantic segmentation and beyond. (arXiv:2201.07734v1 [cs.CV])
    (0 min) This dissertation addresses visual scene understanding and enhances segmentation performance and generalization, training efficiency of networks, and holistic understanding. First, we investigate semantic segmentation in the context of street scenes and train semantic segmentation networks on combinations of various datasets. In Chapter 2 we design a framework of hierarchical classifiers over a single convolutional backbone, and train it end-to-end on a combination of pixel-labeled datasets, improving generalizability and the number of recognizable semantic concepts. Chapter 3 focuses on enriching semantic segmentation with weak supervision and proposes a weakly-supervised algorithm for training with bounding box-level and image-level supervision instead of only with per-pixel supervision. The memory and computational load challenges that arise from simultaneous training on multiple datasets are addressed in Chapter 4. We propose two methodologies for selecting informative and diverse samples from datasets with weak supervision to reduce our networks' ecological footprint without sacrificing performance. Motivated by memory and computation efficiency requirements, in Chapter 5, we rethink simultaneous training on heterogeneous datasets and propose a universal semantic segmentation framework. This framework achieves consistent increases in performance metrics and semantic knowledgeability by exploiting various scene understanding datasets. Chapter 6 introduces the novel task of part-aware panoptic segmentation, which extends our reasoning towards holistic scene understanding. This task combines scene and parts-level semantics with instance-level object detection. In conclusion, our contributions span over convolutional network architectures, weakly-supervised learning, part and panoptic segmentation, paving the way towards a holistic, rich, and sustainable visual scene understanding.
    Physics Informed Deep Kernel Learning. (arXiv:2006.04976v2 [stat.ML] UPDATED)
    (0 min) Deep kernel learning is a promising combination of deep neural networks and nonparametric function learning. However, as a data driven approach, the performance of deep kernel learning can still be restricted by scarce or insufficient data, especially in extrapolation tasks. To address these limitations, we propose Physics Informed Deep Kernel Learning (PI-DKL) that exploits physics knowledge represented by differential equations with latent sources. Specifically, we use the posterior function sample of the Gaussian process as the surrogate for the solution of the differential equation, and construct a generative component to integrate the equation in a principled Bayesian hybrid framework. For efficient and effective inference, we marginalize out the latent variables in the joint probability and derive a collapsed model evidence lower bound (ELBO), based on which we develop a stochastic model estimation algorithm. Our ELBO can be viewed as a nice, interpretable posterior regularization objective. On synthetic datasets and real-world applications, we show the advantage of our approach in both prediction accuracy and uncertainty quantification.
    Cortical lesions, central vein sign, and paramagnetic rim lesions in multiple sclerosis: emerging machine learning techniques and future avenues. (arXiv:2201.07463v1 [eess.IV])
    (0 min) The current multiple sclerosis (MS) diagnostic criteria lack specificity, and this may lead to misdiagnosis, which remains an issue in present-day clinical practice. In addition, conventional biomarkers only moderately correlate with MS disease progression. Recently, advanced MS lesional imaging biomarkers such as cortical lesions (CL), the central vein sign (CVS), and paramagnetic rim lesions (PRL), visible in specialized magnetic resonance imaging (MRI) sequences, have shown higher specificity in differential diagnosis. Moreover, studies have shown that CL and PRL are potential prognostic biomarkers, the former correlating with cognitive impairments and the latter with early disability progression. As machine learning-based methods have achieved extraordinary performance in the assessment of conventional imaging biomarkers, such as white matter lesion segmentation, several automated or semi-automated methods have been proposed for CL, CVS, and PRL as well. In the present review, we first introduce these advanced MS imaging biomarkers and their imaging methods. Subsequently, we describe the corresponding machine learning-based methods that were used to tackle these clinical questions, putting them into context with respect to the challenges they are still facing, including non-standardized MRI protocols, limited datasets, and moderate inter-rater variability. We conclude by presenting the current limitations that prevent their broader deployment and suggesting future research directions.
    Look Closer: Bridging Egocentric and Third-Person Views with Transformers for Robotic Manipulation. (arXiv:2201.07779v1 [cs.RO])
    (0 min) Learning to solve precision-based manipulation tasks from visual feedback using Reinforcement Learning (RL) could drastically reduce the engineering efforts required by traditional robot systems. However, performing fine-grained motor control from visual inputs alone is challenging, especially with a static third-person camera as often used in previous work. We propose a setting for robotic manipulation in which the agent receives visual feedback from both a third-person camera and an egocentric camera mounted on the robot's wrist. While the third-person camera is static, the egocentric camera enables the robot to actively control its vision to aid in precise manipulation. To fuse visual information from both cameras effectively, we additionally propose to use Transformers with a cross-view attention mechanism that models spatial attention from one view to another (and vice-versa), and use the learned features as input to an RL policy. Our method improves learning over strong single-view and multi-view baselines, and successfully transfers to a set of challenging manipulation tasks on a real robot with uncalibrated cameras, no access to state information, and a high degree of task variability. In a hammer manipulation task, our method succeeds in 75% of trials versus 38% and 13% for multi-view and single-view baselines, respectively.
    Learning Interpretable Models Using an Oracle. (arXiv:1906.06852v3 [cs.LG] UPDATED)
    (0 min) As Machine Learning (ML) becomes pervasive in various real world systems, the need for models to be understandable has increased. We focus on interpretability, noting that models often need to be constrained in size for them to be considered interpretable, e.g., a decision tree of depth 5 is easier to interpret than one of depth 50. But smaller models also tend to have high bias. This suggests a trade-off between interpretability and accuracy. We propose a model agnostic technique to minimize this trade-off. Our strategy is to first learn a powerful, possibly black-box, probabilistic model -- referred to as the oracle -- on the training data. Uncertainty in the oracle's predictions are used to learn a sampling distribution for the training data. The interpretable model is trained on a sample obtained using this distribution. We demonstrate that such a model often is significantly more accurate than one trained on the original data. Determining the sampling strategy is formulated as an optimization problem. Our solution to this problem possesses the following key favorable properties: (1) the number of optimization variables is independent of the dimensionality of the data: a fixed number of seven variables are used (2) our technique is model agnostic - in that both the interpretable model and the oracle may belong to arbitrary model families. Results using multiple real world datasets, using Linear Probability Models and Decision Trees as interpretable models, with Gradient Boosted Model and Random Forest as oracles, are presented. We observe significant relative improvements in the F1-score in most cases, occasionally seeing improvements greater than 100%. Additionally, we discuss an interesting application of our technique where a Gated Recurrent Unit network is used to improve the sequence classification accuracy of a Decision Tree that uses character n-grams as features.
    DSNet: Dynamic Skin Deformation Prediction by Recurrent Neural Network. (arXiv:2201.07660v1 [cs.GR])
    (0 min) Skin dynamics contributes to the enriched realism of human body models in rendered scenes. Traditional methods rely on physics-based simulations to accurately reproduce the dynamic behavior of soft tissues. Due to the model complexity and thus the heavy computation, however, they do not directly offer practical solutions to domains where real-time performance is desirable. The quality shapes obtained by physics-based simulations are not fully exploited by example-based or more recent datadriven methods neither, with most of them having focused on the modeling of static skin shapes by leveraging quality data. To address these limitations, we present a learningbased method for dynamic skin deformation. At the core of our work is a recurrent neural network that learns to predict the nonlinear, dynamics-dependent shape change over time from pre-existing mesh deformation sequence data. Our network also learns to predict the variation of skin dynamics across different individuals with varying body shapes. After training the network delivers realistic, high-quality skin dynamics that is specific to a person in a real-time course. We obtain results that significantly saves the computational time, while maintaining comparable prediction quality compared to state-of-the-art results.
    On the Convergence Rates of Policy Gradient Methods. (arXiv:2201.07443v1 [math.OC])
    (0 min) We consider infinite-horizon discounted Markov decision problems with finite state and action spaces. We show that with direct parametrization in the policy space, the weighted value function, although non-convex in general, is both quasi-convex and quasi-concave. While quasi-convexity helps explain the convergence of policy gradient methods to global optima, quasi-concavity hints at their convergence guarantees using arbitrarily large step sizes that are not dictated by the Lipschitz constant charactering smoothness of the value function. In particular, we show that when using geometrically increasing step sizes, a general class of policy mirror descent methods, including the natural policy gradient method and a projected Q-descent method, all enjoy a linear rate of convergence without relying on entropy or other strongly convex regularization. In addition, we develop a theory of weak gradient-mapping dominance and use it to prove sharper sublinear convergence rate of the projected policy gradient method. Finally, we also analyze the convergence rate of an inexact policy mirror descent method and estimate its sample complexity under a simple generative model.
    Do not rug on me: Zero-dimensional Scam Detection. (arXiv:2201.07220v1 [cs.CR])
    (0 min) Uniswap, like other DEXs, has gained much attention this year because it is a non-custodial and publicly verifiable exchange that allows users to trade digital assets without trusted third parties. However, its simplicity and lack of regulation also makes it easy to execute initial coin offering scams by listing non-valuable tokens. This method of performing scams is known as rug pull, a phenomenon that already existed in traditional finance but has become more relevant in DeFi. Various projects such as [34,37] have contributed to detecting rug pulls in EVM compatible chains. However, the first longitudinal and academic step to detecting and characterizing scam tokens on Uniswap was made in [44]. The authors collected all the transactions related to the Uniswap V2 exchange and proposed a machine learning algorithm to label tokens as scams. However, the algorithm is only valuable for detecting scams accurately after they have been executed. This paper increases their data set by 20K tokens and proposes a new methodology to label tokens as scams. After manually analyzing the data, we devised a theoretical classification of different malicious maneuvers in Uniswap protocol. We propose various machine-learning-based algorithms with new relevant features related to the token propagation and smart contract heuristics to detect potential rug pulls before they occur. In general, the models proposed achieved similar results. The best model obtained an accuracy of 0.9936, recall of 0.9540, and precision of 0.9838 in distinguishing non-malicious tokens from scams prior to the malicious maneuver.
    Overview frequency principle/spectral bias in deep learning. (arXiv:2201.07395v1 [cs.LG])
    (0 min) Understanding deep learning is increasingly emergent as it penetrates more and more into industry and science. In recent years, a research line from Fourier analysis sheds lights into this magical "black box" by showing a Frequency Principle (F-Principle or spectral bias) of the training behavior of deep neural networks (DNNs) -- DNNs often fit functions from low to high frequency during the training. The F-Principle is first demonstrated by one-dimensional synthetic data followed by the verification in high-dimensional real datasets. A series of works subsequently enhance the validity of the F-Principle. This low-frequency implicit bias reveals the strength of neural network in learning low-frequency functions as well as its deficiency in learning high-frequency functions. Such understanding inspires the design of DNN-based algorithms in practical problems, explains experimental phenomena emerging in various scenarios, and further advances the study of deep learning from the frequency perspective. Although incomplete, we provide an overview of F-Principle and propose some open problems for future research.
    Compressed Smooth Sparse Decomposition. (arXiv:2201.07404v1 [eess.IV])
    (0 min) Image-based anomaly detection systems are of vital importance in various manufacturing applications. The resolution and acquisition rate of such systems is increasing significantly in recent years under the fast development of image sensing technology. This enables the detection of tiny defects in real-time. However, such a high resolution and acquisition rate of image data not only slows down the speed of image processing algorithms but also increases data storage and transmission cost. To tackle this problem, we propose a fast and data-efficient method with theoretical performance guarantee that is suitable for sparse anomaly detection in images with a smooth background (smooth plus sparse signal). The proposed method, named Compressed Smooth Sparse Decomposition (CSSD), is a one-step method that unifies the compressive image acquisition and decomposition-based image processing techniques. To further enhance its performance in a high-dimensional scenario, a Kronecker Compressed Smooth Sparse Decomposition (KronCSSD) method is proposed. Compared to traditional smooth and sparse decomposition algorithms, significant transmission cost reduction and computational speed boost can be achieved with negligible performance loss. Simulation examples and several case studies in various applications illustrate the effectiveness of the proposed framework.
    Enhanced Self-Organizing Map Solution for the Traveling Salesman Problem. (arXiv:2201.07208v1 [cs.NE])
    (0 min) Using an enhanced Self-Organizing Map method, we provided suboptimal solutions to the Traveling Salesman Problem. Besides, we employed hyperparameter tuning to identify the most critical features in the algorithm. All improvements in the benchmark work brought consistent results and may inspire future efforts to improve this algorithm and apply it to different problems.
    GNN-based Android Malware Detection with Jumping Knowledge. (arXiv:2201.07537v1 [cs.CR])
    (0 min) This paper presents a new Android malware detection method based on Graph Neural Networks (GNNs) with Jumping-Knowledge (JK). Android function call graphs (FCGs) consist of a set of program functions and their inter-procedural calls. Thus, this paper proposes a GNN-based method for Android malware detection by capturing meaningful intra-procedural call path patterns. In addition, a Jumping-Knowledge technique is applied to minimize the effect of the over-smoothing problem, which is common in GNNs. The proposed method has been extensively evaluated using two benchmark datasets. The results demonstrate the superiority of our approach compared to baseline methods in terms of key classification metrics, which demonstrates the potential of GNNs in Android malware detection.
    Using machine learning to parametrize postmerger signals from binary neutron stars. (arXiv:2201.06461v1 [gr-qc] CROSS LISTED)
    (0 min) There is growing interest in the detection and characterization of gravitational waves from postmerger oscillations of binary neutron stars. These signals contain information about the nature of the remnant and the high-density and out-of-equilibrium physics of the postmerger processes, which would complement any electromagnetic signal. However, the construction of binary neutron star postmerger waveforms is much more complicated than for binary black holes: (i) there are theoretical uncertainties in the neutron-star equation of state and other aspects of the high-density physics, (ii) numerical simulations are expensive and available ones only cover a small fraction of the parameter space with limited numerical accuracy, and (iii) it is unclear how to parametrize the theoretical uncertainties and interpolate across parameter space. In this work, we describe the use of a machine-learning method called a conditional variational autoencoder (CVAE) to construct postmerger models for hyper/massive neutron star remnant signals based on numerical-relativity simulations. The CVAE provides a probabilistic model, which encodes uncertainties in the training data within a set of latent parameters. We estimate that training such a model will ultimately require $\sim 10^4$ waveforms. However, using synthetic training waveforms as a proof-of-principle, we show that the CVAE can be used as an accurate generative model and that it encodes the equation of state in a useful latent representation.
    The Elliptical Potential Lemma for General Distributions with an Application to Linear Thompson Sampling. (arXiv:2102.07987v3 [stat.ML] UPDATED)
    (0 min) In this note, we introduce a general version of the well-known elliptical potential lemma that is a widely used technique in the analysis of algorithms in sequential learning and decision-making problems. We consider a stochastic linear bandit setting where a decision-maker sequentially chooses among a set of given actions, observes their noisy rewards, and aims to maximize her cumulative expected reward over a decision-making horizon. The elliptical potential lemma is a key tool for quantifying uncertainty in estimating parameters of the reward function, but it requires the noise and the prior distributions to be Gaussian. Our general elliptical potential lemma relaxes this Gaussian requirement which is a highly non-trivial extension for a number of reasons; unlike the Gaussian case, there is no closed-form solution for the covariance matrix of the posterior distribution, the covariance matrix is not a deterministic function of the actions, and the covariance matrix is not decreasing with respect to the semidefinite inequality. While this result is of broad interest, we showcase an application of it to prove an improved Bayesian regret bound for the well-known Thompson sampling algorithm in stochastic linear bandits with changing action sets where prior and noise distributions are general. This bound is minimax optimal up to constants.
    Coupled Support Tensor Machine Classification for Multimodal Neuroimaging Data. (arXiv:2201.07683v1 [stat.ML])
    (0 min) Multimodal data arise in various applications where information about the same phenomenon is acquired from multiple sensors and across different imaging modalities. Learning from multimodal data is of great interest in machine learning and statistics research as this offers the possibility of capturing complementary information among modalities. Multimodal modeling helps to explain the interdependence between heterogeneous data sources, discovers new insights that may not be available from a single modality, and improves decision-making. Recently, coupled matrix-tensor factorization has been introduced for multimodal data fusion to jointly estimate latent factors and identify complex interdependence among the latent factors. However, most of the prior work on coupled matrix-tensor factors focuses on unsupervised learning and there is little work on supervised learning using the jointly estimated latent factors. This paper considers the multimodal tensor data classification problem. A Coupled Support Tensor Machine (C-STM) built upon the latent factors jointly estimated from the Advanced Coupled Matrix Tensor Factorization (ACMTF) is proposed. C-STM combines individual and shared latent factors with multiple kernels and estimates a maximal-margin classifier for coupled matrix tensor data. The classification risk of C-STM is shown to converge to the optimal Bayes risk, making it a statistically consistent rule. C-STM is validated through simulation studies as well as a simultaneous EEG-fMRI analysis. The empirical evidence shows that C-STM can utilize information from multiple sources and provide a better classification performance than traditional single-mode classifiers.
    OrbNet: Deep Learning for Quantum Chemistry Using Symmetry-Adapted Atomic-Orbital Features. (arXiv:2007.08026v3 [physics.chem-ph] UPDATED)
    (0 min) We introduce a machine learning method in which energy solutions from the Schrodinger equation are predicted using symmetry adapted atomic orbitals features and a graph neural-network architecture. \textsc{OrbNet} is shown to outperform existing methods in terms of learning efficiency and transferability for the prediction of density functional theory results while employing low-cost features that are obtained from semi-empirical electronic structure calculations. For applications to datasets of drug-like molecules, including QM7b-T, QM9, GDB-13-T, DrugBank, and the conformer benchmark dataset of Folmsbee and Hutchison, \textsc{OrbNet} predicts energies within chemical accuracy of DFT at a computational cost that is thousand-fold or more reduced.
    Human-Level Control through Directly-Trained Deep Spiking Q-Networks. (arXiv:2201.07211v1 [cs.NE])
    (0 min) As the third-generation neural networks, Spiking Neural Networks (SNNs) have great potential on neuromorphic hardware because of their high energy-efficiency. However, Deep Spiking Reinforcement Learning (DSRL), i.e., the Reinforcement Learning (RL) based on SNNs, is still in its preliminary stage due to the binary output and the non-differentiable property of the spiking function. To address these issues, we propose a Deep Spiking Q-Network (DSQN) in this paper. Specifically, we propose a directly-trained deep spiking reinforcement learning architecture based on the Leaky Integrate-and-Fire (LIF) neurons and Deep Q-Network (DQN). Then, we adapt a direct spiking learning algorithm for the Deep Spiking Q-Network. We further demonstrate the advantages of using LIF neurons in DSQN theoretically. Comprehensive experiments have been conducted on 17 top-performing Atari games to compare our method with the state-of-the-art conversion method. The experimental results demonstrate the superiority of our method in terms of performance, stability, robustness and energy-efficiency. To the best of our knowledge, our work is the first one to achieve state-of-the-art performance on multiple Atari games with the directly-trained SNN.
    Stein Variational Gaussian Processes. (arXiv:2009.12141v3 [stat.ML] UPDATED)
    (0 min) We show how to use Stein variational gradient descent (SVGD) to carry out inference in Gaussian process (GP) models with non-Gaussian likelihoods and large data volumes. Markov chain Monte Carlo (MCMC) is extremely computationally intensive for these situations, but the parametric assumptions required for efficient variational inference (VI) result in incorrect inference when they encounter the multi-modal posterior distributions that are common for such models. SVGD provides a non-parametric alternative to variational inference which is substantially faster than MCMC. We prove that for GP models with Lipschitz gradients the SVGD algorithm monotonically decreases the Kullback-Leibler divergence from the sampling distribution to the true posterior. Our method is demonstrated on benchmark problems in both regression and classification, a multimodal posterior, and an air quality example with 550,134 spatiotemporal observations, showing substantial performance improvements over MCMC and VI.
    Flexible Parallel Learning in Edge Scenarios: Communication, Computational and Energy Cost. (arXiv:2201.07402v1 [cs.NI])
    (0 min) Traditionally, distributed machine learning takes the guise of (i) different nodes training the same model (as in federated learning), or (ii) one model being split among multiple nodes (as in distributed stochastic gradient descent). In this work, we highlight how fog- and IoT-based scenarios often require combining both approaches, and we present a framework for flexible parallel learning (FPL), achieving both data and model parallelism. Further, we investigate how different ways of distributing and parallelizing learning tasks across the participating nodes result in different computation, communication, and energy costs. Our experiments, carried out using state-of-the-art deep-network architectures and large-scale datasets, confirm that FPL allows for an excellent trade-off among computational (hence energy) cost, communication overhead, and learning performance.
    Towards Federated Clustering: A Federated Fuzzy $c$-Means Algorithm (FFCM). (arXiv:2201.07316v1 [cs.LG])
    (0 min) Federated Learning (FL) is a setting where multiple parties with distributed data collaborate in training a joint Machine Learning (ML) model while keeping all data local at the parties. Federated clustering is an area of research within FL that is concerned with grouping together data that is globally similar while keeping all data local. We describe how this area of research can be of interest in itself, or how it helps addressing issues like non-independently-identically-distributed (i.i.d.) data in supervised FL frameworks. The focus of this work, however, is an extension of the federated fuzzy $c$-means algorithm to the FL setting (FFCM) as a contribution towards federated clustering. We propose two methods to calculate global cluster centers and evaluate their behaviour through challenging numerical experiments. We observe that one of the methods is able to identify good global clusters even in challenging scenarios, but also acknowledge that many challenges remain open.
    Near-Optimal Sparse Allreduce for Distributed Deep Learning. (arXiv:2201.07598v1 [cs.DC])
    (0 min) Communication overhead is one of the major obstacles to train large deep learning models at scale. Gradient sparsification is a promising technique to reduce the communication volume. However, it is very challenging to obtain real performance improvement because of (1) the difficulty of achieving an scalable and efficient sparse allreduce algorithm and (2) the sparsification overhead. This paper proposes O$k$-Top$k$, a scheme for distributed training with sparse gradients. O$k$-Top$k$ integrates a novel sparse allreduce algorithm (less than 6$k$ communication volume which is asymptotically optimal) with the decentralized parallel Stochastic Gradient Descent (SGD) optimizer, and its convergence is proved. To reduce the sparsification overhead, O$k$-Top$k$ efficiently selects the top-$k$ gradient values according to an estimated threshold. Evaluations are conducted on the Piz Daint supercomputer with neural network models from different deep learning domains. Empirical results show that O$k$-Top$k$ achieves similar model accuracy to dense allreduce. Compared with the optimized dense and the state-of-the-art sparse allreduces, O$k$-Top$k$ is more scalable and significantly improves training throughput (e.g., 3.29x-12.95x improvement for BERT on 256 GPUs).
    Decentralized Federated Learning: Balancing Communication and Computing Costs. (arXiv:2107.12048v3 [cs.LG] UPDATED)
    (0 min) Decentralized stochastic gradient descent (SGD) is a driving engine for decentralized federated learning (DFL). The performance of decentralized SGD is jointly influenced by inter-node communications and local updates. In this paper, we propose a general DFL framework, which implements both multiple local updates and multiple inter-node communications periodically, to strike a balance between communication efficiency and model consensus. It can provide a general decentralized SGD analytical framework. We establish strong convergence guarantees for the proposed DFL algorithm without the assumption of convex objectives. The convergence rate of DFL can be optimized to achieve the balance of communication and computing costs under constrained resources. For improving communication efficiency of DFL, compressed communication is further introduced to the proposed DFL as a new scheme, named DFL with compressed communication (C-DFL). The proposed C-DFL exhibits linear convergence for strongly convex objectives. Experiment results based on MNIST and CIFAR-10 datasets illustrate the superiority of DFL over traditional decentralized SGD methods and show that C-DFL further enhances communication efficiency.
    Simpler is better: spectral regularization and up-sampling techniques for variational autoencoders. (arXiv:2201.07544v1 [cs.LG])
    (0 min) Full characterization of the spectral behavior of generative models based on neural networks remains an open issue. Recent research has focused heavily on generative adversarial networks and the high-frequency discrepancies between real and generated images. The current solution to avoid this is to either replace transposed convolutions with bilinear up-sampling or add a spectral regularization term in the generator. It is well known that Variational Autoencoders (VAEs) also suffer from these issues. In this work, we propose a simple 2D Fourier transform-based spectral regularization loss for the VAE and show that it can achieve results equal to, or better than, the current state-of-the-art in frequency-aware losses for generative models. In addition, we experiment with altering the up-sampling procedure in the generator network and investigate how it influences the spectral performance of the model. We include experiments on synthetic and real data sets to demonstrate our results.
    Self-Supervised Metric Learning in Multi-View Data: A Downstream Task Perspective. (arXiv:2106.07138v3 [stat.ML] UPDATED)
    (0 min) Self-supervised metric learning has been a successful approach for learning a distance from an unlabeled dataset. The resulting distance is broadly useful for improving various distance-based downstream tasks, even when no information from downstream tasks is utilized in the metric learning stage. To gain insights into this approach, we develop a statistical framework to theoretically study how self-supervised metric learning can benefit downstream tasks in the context of multi-view data. Under this framework, we show that the target distance of metric learning satisfies several desired properties for the downstream tasks. On the other hand, our investigation suggests the target distance can be further improved by moderating each direction's weights. In addition, our analysis precisely characterizes the improvement by self-supervised metric learning on four commonly used downstream tasks: sample identification, two-sample testing, $k$-means clustering, and $k$-nearest neighbor classification. When the distance is estimated from an unlabeled dataset, we establish the upper bound on distance estimation's accuracy and the number of samples sufficient for downstream task improvement. Finally, numerical experiments are presented to support the theoretical results in the paper.
    Towards Personalized Federated Learning. (arXiv:2103.00710v2 [cs.LG] UPDATED)
    (0 min) In parallel with the rapid adoption of Artificial Intelligence (AI) empowered by advances in AI research, there have been growing awareness and concerns of data privacy. Recent significant developments in the data regulation landscape have prompted a seismic shift in interest towards privacy-preserving AI. This has contributed to the popularity of Federated Learning (FL), the leading paradigm for the training of machine learning models on data silos in a privacy-preserving manner. In this survey, we explore the domain of Personalized FL (PFL) to address the fundamental challenges of FL on heterogeneous data, a universal characteristic inherent in all real-world datasets. We analyze the key motivations for PFL and present a unique taxonomy of PFL techniques categorized according to the key challenges and personalization strategies in PFL. We highlight their key ideas, challenges and opportunities and envision promising future trajectories of research towards new PFL architectural design, realistic PFL benchmarking, and trustworthy PFL approaches.
    Model Transferring Attacks to Backdoor HyperNetwork in Personalized Federated Learning. (arXiv:2201.07063v2 [cs.LG] UPDATED)
    (0 min) This paper explores previously unknown backdoor risks in HyperNet-based personalized federated learning (HyperNetFL) through poisoning attacks. Based upon that, we propose a novel model transferring attack (called HNTROJ), i.e., the first of its kind, to transfer a local backdoor infected model to all legitimate and personalized local models, which are generated by the HyperNetFL model, through consistent and effective malicious local gradients computed across all compromised clients in the whole training process. As a result, HNTROJ reduces the number of compromised clients needed to successfully launch the attack without any observable signs of sudden shifts or degradation regarding model utility on legitimate data samples making our attack stealthy. To defend against HNTROJ, we adapted several backdoor-resistant FL training algorithms into HyperNetFL. An extensive experiment that is carried out using several benchmark datasets shows that HNTROJ significantly outperforms data poisoning and model replacement attacks and bypasses robust training algorithms.
    Multi-Agent Curricula and Emergent Implicit Signaling. (arXiv:2106.11156v2 [cs.MA] UPDATED)
    (0 min) Emergent communication has made strides towards learning communication from scratch, but has focused primarily on protocols that resemble human language. In nature, multi-agent cooperation gives rise to a wide range of communication that varies in structure and complexity. In this work, we recognize the full spectrum of communication that exists in nature and propose studying lower-level communication. Specifically, we study emergent implicit signaling in the context of decentralized multi-agent learning in difficult, sparse reward environments. However, learning to coordinate in such environments is challenging. We propose a curriculum-driven strategy that combines: (i) velocity-based environment shaping, tailored to the skill level of the multi-agent team; and (ii) a behavioral curriculum that helps agents learn successful single-agent behaviors as a precursor to learning multi-agent behaviors. Pursuit-evasion experiments show that our approach learns effective coordination, significantly outperforming sophisticated analytical and learned policies. Our method completes the pursuit-evasion task even when pursuers move at half of the evader's speed, whereas the highest-performing baseline fails at 80% of the evader's speed. Moreover, we examine the use of implicit signals in coordination through position-based social influence. We show that pursuers trained with our strategy exchange more than twice as much information (in bits) than baseline methods, indicating that our method has learned, and relies heavily on, the exchange of implicit signals.
    Lifelong Generative Learning via Knowledge Reconstruction. (arXiv:2201.06418v1 [cs.LG] CROSS LISTED)
    (0 min) Generative models often incur the catastrophic forgetting problem when they are used to sequentially learning multiple tasks, i.e., lifelong generative learning. Although there are some endeavors to tackle this problem, they suffer from high time-consumptions or error accumulation. In this work, we develop an efficient and effective lifelong generative model based on variational autoencoder (VAE). Unlike the generative adversarial network, VAE enjoys high efficiency in the training process, providing natural benefits with few resources. We deduce a lifelong generative model by expending the intrinsic reconstruction character of VAE to the historical knowledge retention. Further, we devise a feedback strategy about the reconstructed data to alleviate the error accumulation. Experiments on the lifelong generating tasks of MNIST, FashionMNIST, and SVHN verified the efficacy of our approach, where the results were comparable to SOTA.
    TranAD: Deep Transformer Networks for Anomaly Detection in Multivariate Time Series Data. (arXiv:2201.07284v1 [cs.LG])
    (0 min) Efficient anomaly detection and diagnosis in multivariate time-series data is of great importance for modern industrial applications. However, building a system that is able to quickly and accurately pinpoint anomalous observations is a challenging problem. This is due to the lack of anomaly labels, high data volatility and the demands of ultra-low inference times in modern applications. Despite the recent developments of deep learning approaches for anomaly detection, only a few of them can address all of these challenges. In this paper, we propose TranAD, a deep transformer network based anomaly detection and diagnosis model which uses attention-based sequence encoders to swiftly perform inference with the knowledge of the broader temporal trends in the data. TranAD uses focus score-based self-conditioning to enable robust multi-modal feature extraction and adversarial training to gain stability. Additionally, model-agnostic meta learning (MAML) allows us to train the model using limited data. Extensive empirical studies on six publicly available datasets demonstrate that TranAD can outperform state-of-the-art baseline methods in detection and diagnosis performance with data and time-efficient training. Specifically, TranAD increases F1 scores by up to 17%, reducing training times by up to 99% compared to the baselines.
    Including STDP to eligibility propagation in multi-layer recurrent spiking neural networks. (arXiv:2201.07602v1 [cs.NE])
    (0 min) Spiking neural networks (SNNs) in neuromorphic systems are more energy efficient compared to deep learning-based methods, but there is no clear competitive learning algorithm for training such SNNs. Eligibility propagation (e-prop) offers an efficient and biologically plausible way to train competitive recurrent SNNs in low-power neuromorphic hardware. In this report, previous performance of e-prop on a speech classification task is reproduced, and the effects of including STDP-like behavior are analyzed. Including STDP to the ALIF neuron model improves the classification performance, but this is not the case for the Izhikevich e-prop neuron. Finally, it was found that e-prop implemented in a single-layer recurrent SNN consistently outperforms a multi-layer variant.
    Can't Steal? Cont-Steal! Contrastive Stealing Attacks Against Image Encoders. (arXiv:2201.07513v1 [cs.CR])
    (0 min) Unsupervised representation learning techniques have been developing rapidly to make full use of unlabeled images. They encode images into rich features that are oblivious to downstream tasks. Behind its revolutionary representation power, the requirements for dedicated model designs and a massive amount of computation resources expose image encoders to the risks of potential model stealing attacks -- a cheap way to mimic the well-trained encoder performance while circumventing the demanding requirements. Yet conventional attacks only target supervised classifiers given their predicted labels and/or posteriors, which leaves the vulnerability of unsupervised encoders unexplored. In this paper, we first instantiate the conventional stealing attacks against encoders and demonstrate their severer vulnerability compared with downstream classifiers. To better leverage the rich representation of encoders, we further propose Cont-Steal, a contrastive-learning-based attack, and validate its improved stealing effectiveness in various experiment settings. As a takeaway, we appeal to our community's attention to the intellectual property protection of representation learning techniques, especially to the defenses against encoder stealing attacks like ours.
    Parameterized MDPs and Reinforcement Learning Problems -- A Maximum Entropy Principle Based Framework. (arXiv:2006.09646v3 [cs.LG] UPDATED)
    (0 min) We present a framework to address a class of sequential decision making problems. Our framework features learning the optimal control policy with robustness to noisy data, determining the unknown state and action parameters, and performing sensitivity analysis with respect to problem parameters. We consider two broad categories of sequential decision making problems modelled as infinite horizon Markov Decision Processes (MDPs) with (and without) an absorbing state. The central idea underlying our framework is to quantify exploration in terms of the Shannon Entropy of the trajectories under the MDP and determine the stochastic policy that maximizes it while guaranteeing a low value of the expected cost along a trajectory. This resulting policy enhances the quality of exploration early on in the learning process, and consequently allows faster convergence rates and robust solutions even in the presence of noisy data as demonstrated in our comparisons to popular algorithms such as Q-learning, Double Q-learning and entropy regularized Soft Q-learning. The framework extends to the class of parameterized MDP and RL problems, where states and actions are parameter dependent, and the objective is to determine the optimal parameters along with the corresponding optimal policy. Here, the associated cost function can possibly be non-convex with multiple poor local minima. Simulation results applied to a 5G small cell network problem demonstrate successful determination of communication routes and the small cell locations. We also obtain sensitivity measures to problem parameters and robustness to noisy environment data.
    Sparsification of Decomposable Submodular Functions. (arXiv:2201.07289v1 [cs.DS])
    (0 min) Submodular functions are at the core of many machine learning and data mining tasks. The underlying submodular functions for many of these tasks are decomposable, i.e., they are sum of several simple submodular functions. In many data intensive applications, however, the number of underlying submodular functions in the original function is so large that we need prohibitively large amount of time to process it and/or it does not even fit in the main memory. To overcome this issue, we introduce the notion of sparsification for decomposable submodular functions whose objective is to obtain an accurate approximation of the original function that is a (weighted) sum of only a few submodular functions. Our main result is a polynomial-time randomized sparsification algorithm such that the expected number of functions used in the output is independent of the number of underlying submodular functions in the original function. We also study the effectiveness of our algorithm under various constraints such as matroid and cardinality constraints. We complement our theoretical analysis with an empirical study of the performance of our algorithm.
    ReGNL: Rapid Prediction of GDP during Disruptive Events using Nightlights. (arXiv:2201.07612v1 [cs.LG])
    (0 min) Policy makers often make decisions based on parameters such as GDP, unemployment rate, industrial output, etc. The primary methods to obtain or even estimate such information are resource intensive and time consuming. In order to make timely and well-informed decisions, it is imperative to be able to come up with proxies for these parameters which can be sampled quickly and efficiently, especially during disruptive events, like the COVID-19 pandemic. Recently, there has been a lot of focus on using remote sensing data for this purpose. The data has become cheaper to collect compared to surveys, and can be available in real time. In this work, we present Regional GDP NightLight (ReGNL), a neural network based model which is trained on a custom dataset of historical nightlights and GDP data along with the geographical coordinates of a place, and estimates the GDP of the place, given the other parameters. Taking the case of 50 US states, we find that ReGNL is disruption-agnostic and is able to predict the GDP for both normal years (2019) and for years with a disruptive event (2020). ReGNL outperforms timeseries ARIMA methods for prediction, even during the pandemic. Following from our findings, we make a case for building infrastructures to collect and make available granular data, especially in resource-poor geographies, so that these can be leveraged for policy making during disruptive events.
    PADA: Example-based Prompt Learning for on-the-fly Adaptation to Unseen Domains. (arXiv:2102.12206v3 [cs.CL] UPDATED)
    (0 min) Natural Language Processing algorithms have made incredible progress, but they still struggle when applied to out-of-distribution examples. We address a challenging and underexplored version of this domain adaptation problem, where an algorithm is trained on several source domains, and then applied to examples from unseen domains that are unknown at training time. Particularly, no examples, labeled or unlabeled, or any other knowledge about the target domain are available to the algorithm at training time. We present PADA: An example-based autoregressive Prompt learning algorithm for on-the-fly Any-Domain Adaptation, based on the T5 language model. Given a test example, PADA first generates a unique prompt for it and then, conditioned on this prompt, labels the example with respect to the NLP prediction task. PADA is trained to generate a prompt which is a token sequence of unrestricted length, consisting of Domain Related Features (DRFs) that characterize each of the source domains. Intuitively, the generated prompt is a unique signature that maps the test example to a semantic space spanned by the source domains. In experiments with 3 tasks (text classification and sequence tagging), for a total of 14 multi-source adaptation scenarios, PADA substantially outperforms strong baselines.
    Model-Driven Deep Learning Based Channel Estimation and Feedback for Millimeter-Wave Massive Hybrid MIMO Systems. (arXiv:2104.11052v3 [cs.IT] CROSS LISTED)
    (2 min) This paper proposes a model-driven deep learning (MDDL)-based channel estimation and feedback scheme for wideband millimeter-wave (mmWave) massive hybrid multiple-input multiple-output (MIMO) systems, where the angle-delay domain channels' sparsity is exploited for reducing the overhead. Firstly, we consider the uplink channel estimation for time-division duplexing systems. To reduce the uplink pilot overhead for estimating the high-dimensional channels from a limited number of radio frequency (RF) chains at the base station (BS), we propose to jointly train the phase shift network and the channel estimator as an auto-encoder. Particularly, by exploiting the channels' structured sparsity from an a priori model and learning the integrated trainable parameters from the data samples, the proposed multiple-measurement-vectors learned approximate message passing (MMV-LAMP) network with the devised redundant dictionary can jointly recover multiple subcarriers' channels with significantly enhanced performance. Moreover, we consider the downlink channel estimation and feedback for frequency-division duplexing systems. Similarly, the pilots at the BS and channel estimator at the users can be jointly trained as an encoder and a decoder, respectively. Besides, to further reduce the channel feedback overhead, only the received pilots on part of the subcarriers are fed back to the BS, which can exploit the MMV-LAMP network to reconstruct the spatial-frequency channel matrix. Numerical results show that the proposed MDDL-based channel estimation and feedback scheme outperforms the state-of-the-art approaches.
    TourBERT: A pretrained language model for the tourism industry. (arXiv:2201.07449v1 [cs.CL])
    (0 min) The Bidirectional Encoder Representations from Transformers (BERT) is currently one of the most important and state-of-the-art models for natural language. However, it has also been shown that for domain-specific tasks it is helpful to pretrain BERT on a domain-specific corpus. In this paper, we present TourBERT, a pretrained language model for tourism. We describe how TourBERT was developed and evaluated. The evaluations show that TourBERT is outperforming BERT in all tourism-specific tasks.
    Weakly Supervised Contrastive Learning for Better Severity Scoring of Lung Ultrasound. (arXiv:2201.07357v1 [eess.IV])
    (2 min) With the onset of the COVID-19 pandemic, ultrasound has emerged as an effective tool for bedside monitoring of patients. Due to this, a large amount of lung ultrasound scans have been made available which can be used for AI based diagnosis and analysis. Several AI-based patient severity scoring models have been proposed that rely on scoring the appearance of the ultrasound scans. AI models are trained using ultrasound-appearance severity scores that are manually labeled based on standardized visual features. We address the challenge of labeling every ultrasound frame in the video clips. Our contrastive learning method treats the video clip severity labels as noisy weak severity labels for individual frames, thus requiring only video-level labels. We show that it performs better than the conventional cross-entropy loss based training. We combine frame severity predictions to come up with video severity predictions and show that the frame based model achieves comparable performance to a video based TSM model, on a large dataset combining public and private sources.
    Coordinating Momenta for Cross-silo Federated Learning. (arXiv:2102.03970v2 [cs.LG] UPDATED)
    (2 min) Communication efficiency is crucial for federated learning (FL). Conducting local training steps in clients to reduce the communication frequency between clients and the server is a common method to address this issue. However, this strategy leads to the client drift problem due to \textit{non-i.i.d.} data distributions in different clients which severely deteriorates the performance. In this work, we propose a new method to improve the training performance in cross-silo FL via maintaining double momentum buffers. In our algorithm, one momentum buffer is used to track the server model updating direction, and the other one is adopted to track the local model updating direction. More important, we introduce a novel momentum fusion technique to coordinate the server and local momentum buffers. We also derive the first theoretical convergence analysis involving both the server and local standard momentum SGD. Extensive deep FL experimental results verify that our new approach has a better training performance than the FedAvg and existing standard momentum SGD variants.
    On Dynamic Pricing with Covariates. (arXiv:2112.13254v2 [cs.LG] UPDATED)
    (2 min) We consider the dynamic pricing problem with covariates under a generalized linear demand model: a seller can dynamically adjust the price of a product over a horizon of $T$ time periods, and at each time period $t$, the demand of the product is jointly determined by the price and an observable covariate vector $x_t\in\mathbb{R}^d$ through an unknown generalized linear model. Most of the existing literature assumes the covariate vectors $x_t$'s are independently and identically distributed (i.i.d.); the few papers that relax this assumption either sacrifice model generality or yield sub-optimal regret bounds. In this paper we show that a simple pricing algorithm has an $O(d\sqrt{T}\log T)$ regret upper bound without assuming any statistical structure on the covariates $x_t$ (which can even be arbitrarily chosen). The upper bound on the regret matches the lower bound (even under the i.i.d. assumption) up to logarithmic factors. Our paper thus shows that (i) the i.i.d. assumption is not necessary for obtaining low regret, and (ii) the regret bound can be independent of the (inverse) minimum eigenvalue of the covariance matrix of the $x_t$'s, a quantity present in previous bounds. Furthermore, we discuss a condition under which a better regret is achievable and how a Thompson sampling algorithm can be applied to give an efficient computation of the prices.
    Investigation of Data Augmentation Techniques for Disordered Speech Recognition. (arXiv:2201.05562v1 [cs.SD] CROSS LISTED)
    (2 min) Disordered speech recognition is a highly challenging task. The underlying neuro-motor conditions of people with speech disorders, often compounded with co-occurring physical disabilities, lead to the difficulty in collecting large quantities of speech required for system development. This paper investigates a set of data augmentation techniques for disordered speech recognition, including vocal tract length perturbation (VTLP), tempo perturbation and speed perturbation. Both normal and disordered speech were exploited in the augmentation process. Variability among impaired speakers in both the original and augmented data was modeled using learning hidden unit contributions (LHUC) based speaker adaptive training. The final speaker adapted system constructed using the UASpeech corpus and the best augmentation approach based on speed perturbation produced up to 2.92% absolute (9.3% relative) word error rate (WER) reduction over the baseline system without data augmentation, and gave an overall WER of 26.37% on the test set containing 16 dysarthric speakers.
    Visual Exploration of Machine Learning Model Behavior with Hierarchical Surrogate Rule Sets. (arXiv:2201.07724v1 [cs.HC])
    (0 min) One of the potential solutions for model interpretation is to train a surrogate model: a more transparent model that approximates the behavior of the model to be explained. Typically, classification rules or decision trees are used due to the intelligibility of their logic-based expressions. However, decision trees can grow too deep and rule sets can become too large to approximate a complex model. Unlike paths on a decision tree that must share ancestor nodes (conditions), rules are more flexible. However, the unstructured visual representation of rules makes it hard to make inferences across rules. To address these issues, we present a workflow that includes novel algorithmic and interactive solutions. First, we present Hierarchical Surrogate Rules (HSR), an algorithm that generates hierarchical rules based on user-defined parameters. We also contribute SuRE, a visual analytics (VA) system that integrates HSR and interactive surrogate rule visualizations. Particularly, we present a novel feature-aligned tree to overcome the shortcomings of existing rule visualizations. We evaluate the algorithm in terms of parameter sensitivity, time performance, and comparison with surrogate decision trees and find that it scales reasonably well and outperforms decision trees in many respects. We also evaluate the visualization and the VA system by a usability study with 24 volunteers and an observational study with 7 domain experts. Our investigation shows that the participants can use feature-aligned trees to perform non-trivial tasks with very high accuracy. We also discuss many interesting observations that can be useful for future research on designing effective rule-based VA systems.
    Detection of Correlated Alarms Using Graph Embedding. (arXiv:2201.07748v1 [cs.LG])
    (0 min) Industrial alarm systems have recently progressed considerably in terms of network complexity and the number of alarms. The increase in complexity and number of alarms presents challenges in these systems that decrease system efficiency and cause distrust of the operator, which might result in widespread damages. One contributing factor in alarm inefficiency is the correlated alarms. These alarms do not contain new information and only confuse the operator. This paper tries to present a novel method for detecting correlated alarms based on artificial intelligence methods to help the operator. The proposed method is based on graph embedding and alarm clustering, resulting in the detection of correlated alarms. To evaluate the proposed method, a case study is conducted on the well-known Tennessee-Eastman process.
    Semi-Supervised Clustering with Contrastive Learning for Discovering New Intents. (arXiv:2201.07604v1 [cs.LG])
    (0 min) Most dialogue systems in real world rely on predefined intents and answers for QA service, so discovering potential intents from large corpus previously is really important for building such dialogue services. Considering that most scenarios have few intents known already and most intents waiting to be discovered, we focus on semi-supervised text clustering and try to make the proposed method benefit from labeled samples for better overall clustering performance. In this paper, we propose Deep Contrastive Semi-supervised Clustering (DCSC), which aims to cluster text samples in a semi-supervised way and provide grouped intents to operation staff. To make DCSC fully utilize the limited known intents, we propose a two-stage training procedure for DCSC, in which DCSC will be trained on both labeled samples and unlabeled samples, and achieve better text representation and clustering performance. We conduct experiments on two public datasets to compare our model with several popular methods, and the results show DCSC achieve best performance across all datasets and circumstances, indicating the effect of the improvements in our work.
    On the Acceleration of Deep Learning Model Parallelism with Staleness. (arXiv:1909.02625v3 [cs.LG] UPDATED)
    (0 min) Training the deep convolutional neural network for computer vision problems is slow and inefficient, especially when it is large and distributed across multiple devices. The inefficiency is caused by the backpropagation algorithm's forward locking, backward locking, and update locking problems. Existing solutions for acceleration either can only handle one locking problem or lead to severe accuracy loss or memory inefficiency. Moreover, none of them consider the straggler problem among devices. In this paper, we propose Layer-wise Staleness and a novel efficient training algorithm, Diversely Stale Parameters (DSP), to address these challenges. We also analyze the convergence of DSP with two popular gradient-based methods and prove that both of them are guaranteed to converge to critical points for non-convex problems. Finally, extensive experimental results on training deep learning models demonstrate that our proposed DSP algorithm can achieve significant training speedup with stronger robustness than compared methods.
    Tiny, always-on and fragile: Bias propagation through design choices in on-device machine learning workflows. (arXiv:2201.07677v1 [cs.LG])
    (0 min) Billions of distributed, heterogeneous and resource constrained smart consumer devices deploy on-device machine learning (ML) to deliver private, fast and offline inference on personal data. On-device ML systems are highly context dependent, and sensitive to user, usage, hardware and environmental attributes. Despite this sensitivity and the propensity towards bias in ML, bias in on-device ML has not been studied. This paper studies the propagation of bias through design choices in on-device ML development workflows. We position \emph{reliablity bias}, which arises from disparate device failures across demographic groups, as a source of unfairness in on-device ML settings and quantify metrics to evaluate it. We then identify complex and interacting technical design choices in the on-device ML workflow that can lead to disparate performance across user groups, and thus \emph{reliability bias}. Finally, we show with an empirical case study that seemingly innocuous design choices such as the data sample rate, pre-processing parameters used to construct input features and pruning hyperparameters propagate \emph{reliability bias} through an audio keyword spotting development workflow. We leverage our insights to suggest strategies for developers to develop fairer on-device ML.
    KST-GCN: A Knowledge-Driven Spatial-Temporal Graph Convolutional Network for Traffic Forecasting. (arXiv:2011.14992v2 [cs.LG] UPDATED)
    (0 min) While considering the spatial and temporal features of traffic, capturing the impacts of various external factors on travel is an essential step towards achieving accurate traffic forecasting. However, existing studies seldom consider external factors or neglect the effect of the complex correlations among external factors on traffic. Intuitively, knowledge graphs can naturally describe these correlations. Since knowledge graphs and traffic networks are essentially heterogeneous networks, it is challenging to integrate the information in both networks. On this background, this study presents a knowledge representation-driven traffic forecasting method based on spatial-temporal graph convolutional networks. We first construct a knowledge graph for traffic forecasting and derive knowledge representations by a knowledge representation learning method named KR-EAR. Then, we propose the Knowledge Fusion Cell (KF-Cell) to combine the knowledge and traffic features as the input of a spatial-temporal graph convolutional backbone network. Experimental results on the real-world dataset show that our strategy enhances the forecasting performances of backbones at various prediction horizons. The ablation and perturbation analysis further verify the effectiveness and robustness of the proposed method. To the best of our knowledge, this is the first study that constructs and utilizes a knowledge graph to facilitate traffic forecasting; it also offers a promising direction to integrate external information and spatial-temporal information for traffic forecasting. The source code is available at https://github.com/lehaifeng/T-GCN/tree/master/KST-GCN.
    Cooperative Multi-Agent Fairness and Equivariant Policies. (arXiv:2106.05727v3 [cs.AI] UPDATED)
    (0 min) We study fairness through the lens of cooperative multi-agent learning. Our work is motivated by empirical evidence that naive maximization of team reward yields unfair outcomes for individual team members. To address fairness in multi-agent contexts, we introduce team fairness, a group-based fairness measure for multi-agent learning. We then prove that it is possible to enforce team fairness during policy optimization by transforming the team's joint policy into an equivariant map. We refer to our multi-agent learning strategy as Fairness through Equivariance (Fair-E) and demonstrate its effectiveness empirically. We then introduce Fairness through Equivariance Regularization (Fair-ER) as a soft-constraint version of Fair-E and show that it reaches higher levels of utility than Fair-E and fairer outcomes than non-equivariant policies. Finally, we present novel findings regarding the fairness-utility trade-off in multi-agent settings; showing that the magnitude of the trade-off is dependent on agent skill.
    EP-PQM: Efficient Parametric Probabilistic Quantum Memory with Fewer Qubits and Gates. (arXiv:2201.07265v1 [cs.ET])
    (0 min) Machine learning (ML) classification tasks can be carried out on a quantum computer (QC) using Probabilistic Quantum Memory (PQM) and its extension, Parameteric PQM (P-PQM) by calculating the Hamming distance between an input pattern and a database of $r$ patterns containing $z$ features with $a$ distinct attributes. For accurate computations, the feature must be encoded using one-hot encoding, which is memory-intensive for multi-attribute datasets with $a>2$. We can easily represent multi-attribute data more compactly on a classical computer by replacing one-hot encoding with label encoding. However, replacing these encoding schemes on a QC is not straightforward as PQM and P-PQM operate at the quantum bit level. We present an enhanced P-PQM, called EP-PQM, that allows label encoding of data stored in a PQM data structure and reduces the circuit depth of the data storage and retrieval procedures. We show implementations for an ideal QC and a noisy intermediate-scale quantum (NISQ) device. Our complexity analysis shows that the EP-PQM approach requires $O\left(z \log_2(a)\right)$ qubits as opposed to $O(za)$ qubits for P-PQM. EP-PQM also requires fewer gates, reducing gate count from $O\left(rza\right)$ to $O\left(rz\log_2(a)\right)$. For five datasets, we demonstrate that training an ML classification model using EP-PQM requires 48% to 77% fewer qubits than P-PQM for datasets with $a>2$. EP-PQM reduces circuit depth in the range of 60% to 96%, depending on the dataset. The depth decreases further with a decomposed circuit, ranging between 94% and 99%. EP-PQM requires less space; thus, it can train on and classify larger datasets than previous PQM implementations on NISQ devices. Furthermore, reducing the number of gates speeds up the classification and reduces the noise associated with deep quantum circuits. Thus, EP-PQM brings us closer to scalable ML on a NISQ device.
    Scattering GCN: Overcoming Oversmoothness in Graph Convolutional Networks. (arXiv:2003.08414v4 [cs.LG] UPDATED)
    (0 min) Graph convolutional networks (GCNs) have shown promising results in processing graph data by extracting structure-aware features. This gave rise to extensive work in geometric deep learning, focusing on designing network architectures that ensure neuron activations conform to regularity patterns within the input graph. However, in most cases the graph structure is only accounted for by considering the similarity of activations between adjacent nodes, which limits the capabilities of such methods to discriminate between nodes in a graph. Here, we propose to augment conventional GCNs with geometric scattering transforms and residual convolutions. The former enables band-pass filtering of graph signals, thus alleviating the so-called oversmoothing often encountered in GCNs, while the latter is introduced to clear the resulting features of high-frequency noise. We establish the advantages of the presented Scattering GCN with both theoretical results establishing the complementary benefits of scattering and GCN features, as well as experimental results showing the benefits of our method compared to leading graph neural networks for semi-supervised node classification, including the recently proposed GAT network that typically alleviates oversmoothing using graph attention mechanisms.
    Anytime Optimal PSRO for Two-Player Zero-Sum Games. (arXiv:2201.07700v1 [cs.GT])
    (2 min) Policy Space Response Oracles (PSRO) is a multi-agent reinforcement learning algorithm for games that can handle continuous actions and has empirically found approximate Nash equilibria in large games. PSRO is based on the tabular Double Oracle (DO) method, an algorithm that is guaranteed to converge to a Nash equilibrium, but may increase exploitability from one iteration to the next. We propose Anytime Optimal Double Oracle (AODO), a tabular double oracle algorithm for 2-player zero-sum games that is guaranteed to converge to a Nash equilibrium while decreasing exploitability from iteration to iteration. Unlike DO, in which the meta-strategy is based on the restricted game formed by each player's strategy sets, AODO finds the meta-strategy for each player that minimizes its exploitability against any policy in the full, unrestricted game. We also propose a method of finding this meta-strategy via a no-regret algorithm updated against a continually-trained best response, called RM-BR DO. Finally, we propose Anytime Optimal PSRO, a version of AODO that calculates best responses via reinforcement learning. In experiments on Leduc poker and random normal form games, we show that our methods achieve far lower exploitability than DO and PSRO and never increase exploitability.
    Bregman Deviations of Generic Exponential Families. (arXiv:2201.07306v1 [cs.LG])
    (0 min) We revisit the method of mixture technique, also known as the Laplace method, to study the concentration phenomenon in generic exponential families. Combining the properties of Bregman divergence associated with log-partition function of the family with the method of mixtures for super-martingales, we establish a generic bound controlling the Bregman divergence between the parameter of the family and a finite sample estimate of the parameter. Our bound is time-uniform and makes appear a quantity extending the classical \textit{information gain} to exponential families, which we call the \textit{Bregman information gain}. For the practitioner, we instantiate this novel bound to several classical families, e.g., Gaussian, Bernoulli, Exponential and Chi-square yielding explicit forms of the confidence sets and the Bregman information gain. We further numerically compare the resulting confidence bounds to state-of-the-art alternatives for time-uniform concentration and show that this novel method yields competitive results. Finally, we highlight how our results can be applied in a linear contextual multi-armed bandit problem.
    Uncertainty Quantification in Scientific Machine Learning: Methods, Metrics, and Comparisons. (arXiv:2201.07766v1 [cs.LG])
    (2 min) Neural networks (NNs) are currently changing the computational paradigm on how to combine data with mathematical laws in physics and engineering in a profound way, tackling challenging inverse and ill-posed problems not solvable with traditional methods. However, quantifying errors and uncertainties in NN-based inference is more complicated than in traditional methods. This is because in addition to aleatoric uncertainty associated with noisy data, there is also uncertainty due to limited data, but also due to NN hyperparameters, overparametrization, optimization and sampling errors as well as model misspecification. Although there are some recent works on uncertainty quantification (UQ) in NNs, there is no systematic investigation of suitable methods towards quantifying the total uncertainty effectively and efficiently even for function approximation, and there is even less work on solving partial differential equations and learning operator mappings between infinite-dimensional function spaces using NNs. In this work, we present a comprehensive framework that includes uncertainty modeling, new and existing solution methods, as well as evaluation metrics and post-hoc improvement approaches. To demonstrate the applicability and reliability of our framework, we present an extensive comparative study in which various methods are tested on prototype problems, including problems with mixed input-output data, and stochastic problems in high dimensions. In the Appendix, we include a comprehensive description of all the UQ methods employed, which we will make available as open-source library of all codes included in this framework.
    Interpretable Single-Cell Set Classification with Kernel Mean Embeddings. (arXiv:2201.07322v1 [cs.LG])
    (0 min) Modern single-cell flow and mass cytometry technologies measure the expression of several proteins of the individual cells within a blood or tissue sample. Each profiled biological sample is thus represented by a set of hundreds of thousands of multidimensional cell feature vectors, which incurs a high computational cost to predict each biological sample's associated phenotype with machine learning models. Such a large set cardinality also limits the interpretability of machine learning models due to the difficulty in tracking how each individual cell influences the ultimate prediction. Using Kernel Mean Embedding to encode the cellular landscape of each profiled biological sample, we can train a simple linear classifier and achieve state-of-the-art classification accuracy on 3 flow and mass cytometry datasets. Our model contains few parameters but still performs similarly to deep learning models with millions of parameters. In contrast with deep learning approaches, the linearity and sub-selection step of our model make it easy to interpret classification results. Clustering analysis further shows that our method admits rich biological interpretability for linking cellular heterogeneity to clinical phenotype.
    Debiased Graph Neural Networks with Agnostic Label Selection Bias. (arXiv:2201.07708v1 [cs.LG])
    (0 min) Most existing Graph Neural Networks (GNNs) are proposed without considering the selection bias in data, i.e., the inconsistent distribution between the training set with test set. In reality, the test data is not even available during the training process, making selection bias agnostic. Training GNNs with biased selected nodes leads to significant parameter estimation bias and greatly impacts the generalization ability on test nodes. In this paper, we first present an experimental investigation, which clearly shows that the selection bias drastically hinders the generalization ability of GNNs, and theoretically prove that the selection bias will cause the biased estimation on GNN parameters. Then to remove the bias in GNN estimation, we propose a novel Debiased Graph Neural Networks (DGNN) with a differentiated decorrelation regularizer. The differentiated decorrelation regularizer estimates a sample weight for each labeled node such that the spurious correlation of learned embeddings could be eliminated. We analyze the regularizer in causal view and it motivates us to differentiate the weights of the variables based on their contribution on the confounding bias. Then, these sample weights are used for reweighting GNNs to eliminate the estimation bias, thus help to improve the stability of prediction on unknown test nodes. Comprehensive experiments are conducted on several challenging graph datasets with two kinds of label selection biases. The results well verify that our proposed model outperforms the state-of-the-art methods and DGNN is a flexible framework to enhance existing GNNs.
    Privacy-Aware Human Mobility Prediction via Adversarial Networks. (arXiv:2201.07519v1 [cs.LG])
    (0 min) As various mobile devices and location-based services are increasingly developed in different smart city scenarios and applications, many unexpected privacy leakages have arisen due to geolocated data collection and sharing. While these geolocated data could provide a rich understanding of human mobility patterns and address various societal research questions, privacy concerns for users' sensitive information have limited their utilization. In this paper, we design and implement a novel LSTM-based adversarial mechanism with representation learning to attain a privacy-preserving feature representation of the original geolocated data (mobility data) for a sharing purpose. We quantify the utility-privacy trade-off of mobility datasets in terms of trajectory reconstruction risk, user re-identification risk, and mobility predictability. Our proposed architecture reports a Pareto Frontier analysis that enables the user to assess this trade-off as a function of Lagrangian loss weight parameters. The extensive comparison results on four representative mobility datasets demonstrate the superiority of our proposed architecture and the efficiency of the proposed privacy-preserving features extractor. Our results show that by exploring Pareto optimal setting, we can simultaneously increase both privacy (45%) and utility (32%).
    Batch versus Sequential Active Learning for Recommender Systems. (arXiv:2201.07571v1 [cs.IR])
    (0 min) Recommender systems have been investigated for many years, with the aim of generating the most accurate recommendations possible. However, available data about new users is often insufficient, leading to inaccurate recommendations; an issue that is known as the cold-start problem. A solution can be active learning. Active learning strategies proactively select items and ask users to rate these. This way, detailed user preferences can be acquired and as a result, more accurate recommendations can be offered to the user. In this study, we compare five active learning algorithms, combined with three different predictor algorithms, which are used to estimate to what extent the user would like the item that is asked to rate. In addition, two modes are tested for selecting the items: batch mode (all items at once), and sequential mode (the items one by one). Evaluation of the recommender in terms of rating prediction, decision support, and the ranking of items, showed that sequential mode produces the most accurate recommendations for dense data sets. Differences between the active learning algorithms are small. For most active learners, the best predictor turned out to be FunkSVD in combination with sequential mode.
    Layerwise Geo-Distributed Computing between Cloud and IoT. (arXiv:2201.07215v1 [cs.DC])
    (2 min) In this paper, we propose a novel architecture for a deep learning system, named k-degree layer-wise network, to realize efficient geo-distributed computing between Cloud and Internet of Things (IoT). The geo-distributed computing extends Cloud to the geographical verge of the network in the neighbor of IoT. The basic ideas of the proposal include a k-degree constraint and a layer-wise constraint. The k-degree constraint is defined such that the degree of each vertex on the h-th layer is exactly k(h) to extend the existing deep belief networks and control the communication cost. The layer-wise constraint is defined such that the layer-wise degrees are monotonically decreasing in positive direction to gradually reduce the dimension of data. We prove the k-degree layer-wise network is sparse, while a typical deep neural network is dense. In an evaluation on the M-distributed MNIST database, the proposal is superior to a state-of-the-art model in terms of communication cost and learning time with scalability.
    Waveform Learning for Next-Generation Wireless Communication Systems. (arXiv:2109.00998v2 [cs.IT] UPDATED)
    (2 min) We propose a learning-based method for the joint design of a transmit and receive filter, the constellation geometry and associated bit labeling, as well as a neural network (NN)-based detector. The method maximizes an achievable information rate, while simultaneously satisfying constraints on the adjacent channel leakage ratio (ACLR) and peak-to-average power ratio (PAPR). This allows control of the tradeoff between spectral containment, peak power, and communication rate. Evaluation on an additive white Gaussian noise (AWGN) channel shows significant reduction of ACLR and PAPR compared to a conventional baseline relying on quadrature amplitude modulation (QAM) and root-raised-cosine (RRC), without significant loss of information rate. When considering a 3rd Generation Partnership Project (3GPP) multipath channel, the learned waveform and neural receiver enable competitive or higher rates than an orthogonal frequency division multiplexing (OFDM) baseline, while reducing the ACLR by 10 dB and the PAPR by 2 dB. The proposed method incurs no additional complexity on the transmitter side and might be an attractive tool for waveform design of beyond-5G systems.
    Trajectory Design for UAV-Based Internet-of-Things Data Collection: A Deep Reinforcement Learning Approach. (arXiv:2107.11015v1 [cs.IT] CROSS LISTED)
    (2 min) In this paper, we investigate an unmanned aerial vehicle (UAV)-assisted Internet-of-Things (IoT) system in a sophisticated three-dimensional (3D) environment, where the UAV's trajectory is optimized to efficiently collect data from multiple IoT ground nodes. Unlike existing approaches focusing only on a simplified two-dimensional scenario and the availability of perfect channel state information (CSI), this paper considers a practical 3D urban environment with imperfect CSI, where the UAV's trajectory is designed to minimize data collection completion time subject to practical throughput and flight movement constraints. Specifically, inspired from the state-of-the-art deep reinforcement learning approaches, we leverage the twin-delayed deep deterministic policy gradient (TD3) to design the UAV's trajectory and present a TD3-based trajectory design for completion time minimization (TD3-TDCTM) algorithm. In particular, we set an additional information, i.e., the merged pheromone, to represent the state information of UAV and environment as a reference of reward which facilitates the algorithm design. By taking the service statuses of IoT nodes, the UAV's position, and the merged pheromone as input, the proposed algorithm can continuously and adaptively learn how to adjust the UAV's movement strategy. By interacting with the external environment in the corresponding Markov decision process, the proposed algorithm can achieve a near-optimal navigation strategy. Our simulation results show the superiority of the proposed TD3-TDCTM algorithm over three conventional non-learning based baseline methods.
    Understanding Data Storage and Ingestion for Large-Scale Deep Recommendation Model Training. (arXiv:2108.09373v2 [cs.DC] UPDATED)
    (2 min) Domain-specific accelerators (DSAs) are integrated in datacenter-scale clusters across industry to train increasingly-complex deep learning models over massive datasets. As innovations in DSAs continue to increase training efficiency and throughput, the data storage and ingestion (DSI) pipeline, the systems and hardware responsible for storing and preprocessing training data, will dominate and constrain training capacity. Similar innovation in DSI is urgent, demanding an in-depth understanding of DSI systems, infrastructure, and characteristics. To this end, this paper presents Meta's end-to-end DSI pipeline, composed of a central data warehouse built on distributed storage and a Data PreProcessing Service (DPP) that scales to eliminate data stalls. We characterize how hundreds of models are collaboratively trained across our global fleet, how massive and evolving datasets are stored and read, and how online preprocessing places intense demands on our underlying hardware. We synthesize key takeaways from our characterization and close with a discussion of lessons learned and research opportunities for both industry and academia.
    Input Selection for Bandwidth-Limited Neural Network Inference. (arXiv:1906.04673v2 [cs.LG] UPDATED)
    (2 min) Data are often accommodated on centralized storage servers. This is the case, for instance, in remote sensing and astronomy, where projects produce several petabytes of data every year. While machine learning models are often trained on relatively small subsets of the data, the inference phase typically requires transferring significant amounts of data between the servers and the clients. In many cases, the bandwidth available per user is limited, which then renders the data transfer to be one of the major bottlenecks. In this work, we propose a framework that automatically selects the relevant parts of the input data for a given neural network. The model as well as the associated selection masks are trained simultaneously such that a good model performance is achieved while only a minimal amount of data is selected. During the inference phase, only those parts of the data have to be transferred between the server and the client. We propose both instance-independent and instance-dependent selection masks. The former ones are the same for all instances to be transferred, whereas the latter ones allow for variable transfer sizes per instance. Our experiments show that it is often possible to significantly reduce the amount of data needed to be transferred without affecting the model quality much.
    Multiblock ADMM for nonsmooth nonconvex optimization with nonlinear coupling constraints. (arXiv:2201.07657v1 [math.OC])
    (2 min) This paper considers a multiblock nonsmooth nonconvex optimization problem with nonlinear coupling constraints. By developing the idea of using the information zone and adaptive regime proposed in [J. Bolte, S. Sabach and M. Teboulle, Nonconvex Lagrangian-based optimization: Monitoring schemes and global convergence, Mathematics of Operations Research, 43: 1210--1232, 2018], we propose a multiblock alternating direction method of multipliers for solving this problem. We specify the update of the primal variables by employing a majorization minimization procedure in each block update. An independent convergence analysis is conducted to prove subsequential as well as global convergence of the generated sequence to a critical point of the augmented Lagrangian. We also establish iteration complexity and provide preliminary numerical results for the proposed algorithm.
    Data-Driven Deep Learning Based Hybrid Beamforming for Aerial Massive MIMO-OFDM Systems with Implicit CSI. (arXiv:2201.06778v1 [eess.SP] CROSS LISTED)
    (2 min) In an aerial hybrid massive multiple-input multiple-output (MIMO) and orthogonal frequency division multiplexing (OFDM) system, how to design a spectral-efficient broadband multi-user hybrid beamforming with a limited pilot and feedback overhead is challenging. To this end, by modeling the key transmission modules as an end-to-end (E2E) neural network, this paper proposes a data-driven deep learning (DL)-based unified hybrid beamforming framework for both the time division duplex (TDD) and frequency division duplex (FDD) systems with implicit channel state information (CSI). For TDD systems, the proposed DL-based approach jointly models the uplink pilot combining and downlink hybrid beamforming modules as an E2E neural network. While for FDD systems, we jointly model the downlink pilot transmission, uplink CSI feedback, and downlink hybrid beamforming modules as an E2E neural network. Different from conventional approaches separately processing different modules, the proposed solution simultaneously optimizes all modules with the sum rate as the optimization object. Therefore, by perceiving the inherent property of air-to-ground massive MIMO-OFDM channel samples, the DL-based E2E neural network can establish the mapping function from the channel to the beamformer, so that the explicit channel reconstruction can be avoided with reduced pilot and feedback overhead. Besides, practical low-resolution phase shifters (PSs) introduce the quantization constraint, leading to the intractable gradient backpropagation when training the neural network. To mitigate the performance loss caused by the phase quantization error, we adopt the transfer learning strategy to further fine-tune the E2E neural network based on a pre-trained network that assumes the ideal infinite-resolution PSs. Numerical results show that our DL-based schemes have considerable advantages over state-of-the-art schemes.
    Stability of Deep Neural Networks via discrete rough paths. (arXiv:2201.07566v1 [cs.LG])
    (2 min) Using rough path techniques, we provide a priori estimates for the output of Deep Residual Neural Networks in terms of both the input data and the (trained) network weights. As trained network weights are typically very rough when seen as functions of the layer, we propose to derive stability bounds in terms of the total $p$-variation of trained weights for any $p\in[1,3]$. Unlike the $C^1$-theory underlying the neural ODE literature, our estimates remain bounded even in the limiting case of weights behaving like Brownian motions, as suggested in [arXiv:2105.12245]. Mathematically, we interpret residual neural network as solutions to (rough) difference equations, and analyse them based on recent results of discrete time signatures and rough path theory.
    Object Detection in Autonomous Vehicles: Status and Open Challenges. (arXiv:2201.07706v1 [cs.CV])
    (2 min) Object detection is a computer vision task that has become an integral part of many consumer applications today such as surveillance and security systems, mobile text recognition, and diagnosing diseases from MRI/CT scans. Object detection is also one of the critical components to support autonomous driving. Autonomous vehicles rely on the perception of their surroundings to ensure safe and robust driving performance. This perception system uses object detection algorithms to accurately determine objects such as pedestrians, vehicles, traffic signs, and barriers in the vehicle's vicinity. Deep learning-based object detectors play a vital role in finding and localizing these objects in real-time. This article discusses the state-of-the-art in object detectors and open challenges for their integration into autonomous vehicles.
    Uncovering More Shallow Heuristics: Probing the Natural Language Inference Capacities of Transformer-Based Pre-Trained Language Models Using Syllogistic Patterns. (arXiv:2201.07614v1 [cs.CL])
    (2 min) In this article, we explore the shallow heuristics used by transformer-based pre-trained language models (PLMs) that are fine-tuned for natural language inference (NLI). To do so, we construct or own dataset based on syllogistic, and we evaluate a number of models' performance on our dataset. We find evidence that the models rely heavily on certain shallow heuristics, picking up on symmetries and asymmetries between premise and hypothesis. We suggest that the lack of generalization observable in our study, which is becoming a topic of lively debate in the field, means that the PLMs are currently not learning NLI, but rather spurious heuristics.
    Motion Inbetweening via Deep $\Delta$-Interpolator. (arXiv:2201.06701v2 [cs.LG] UPDATED)
    (2 min) We show that the task of synthesizing missing middle frames, commonly known as motion inbetweening in the animation industry, can be solved more accurately and effectively if a deep learning interpolator operates in the delta mode, using the spherical linear interpolator as a baseline. We demonstrate our empirical findings on the publicly available LaFAN1 dataset. We further generalize this result by showing that the $\Delta$-regime is viable with respect to the reference of the last known frame (also known as the zero-velocity model). This supports the more general conclusion that deep inbetweening in the reference frame local to input frames is more accurate and robust than inbetweening in the global (world) reference frame advocated in previous work. Our code is publicly available at https://github.com/boreshkinai/delta-interpolator.
    Step-Ahead Error Feedback for Distributed Training with Compressed Gradient. (arXiv:2008.05823v2 [cs.LG] UPDATED)
    (2 min) Although the distributed machine learning methods can speed up the training of large deep neural networks, the communication cost has become the non-negligible bottleneck to constrain the performance. To address this challenge, the gradient compression based communication-efficient distributed learning methods were designed to reduce the communication cost, and more recently the local error feedback was incorporated to compensate for the corresponding performance loss. However, in this paper, we will show that a new "gradient mismatch" problem is raised by the local error feedback in centralized distributed training and can lead to degraded performance compared with full-precision training. To solve this critical problem, we propose two novel techniques, 1) step ahead and 2) error averaging, with rigorous theoretical analysis. Both our theoretical and empirical results show that our new methods can handle the "gradient mismatch" problem. The experimental results show that we can even train faster with common gradient compression schemes than both the full-precision training and local error feedback regarding the training epochs and without performance loss.
    Geometric Scattering Attention Networks. (arXiv:2010.15010v2 [cs.LG] UPDATED)
    (2 min) Geometric scattering has recently gained recognition in graph representation learning, and recent work has shown that integrating scattering features in graph convolution networks (GCNs) can alleviate the typical oversmoothing of features in node representation learning. However, scattering often relies on handcrafted design, requiring careful selection of frequency bands via a cascade of wavelet transforms, as well as an effective weight sharing scheme to combine low- and band-pass information. Here, we introduce a new attention-based architecture to produce adaptive task-driven node representations by implicitly learning node-wise weights for combining multiple scattering and GCN channels in the network. We show the resulting geometric scattering attention network (GSAN) outperforms previous networks in semi-supervised node classification, while also enabling a spectral study of extracted information by examining node-wise attention weights.
    Variational Autoencoder Generative Adversarial Network for Synthetic Data Generation in Smart Home. (arXiv:2201.07387v1 [cs.LG])
    (2 min) Data is the fuel of data science and machine learning techniques for smart grid applications, similar to many other fields. However, the availability of data can be an issue due to privacy concerns, data size, data quality, and so on. To this end, in this paper, we propose a Variational AutoEncoder Generative Adversarial Network (VAE-GAN) as a smart grid data generative model which is capable of learning various types of data distributions and generating plausible samples from the same distribution without performing any prior analysis on the data before the training phase.We compared the Kullback-Leibler (KL) divergence, maximum mean discrepancy (MMD), and Wasserstein distance between the synthetic data (electrical load and PV production) distribution generated by the proposed model, vanilla GAN network, and the real data distribution, to evaluate the performance of our model. Furthermore, we used five key statistical parameters to describe the smart grid data distribution and compared them between synthetic data generated by both models and real data. Experiments indicate that the proposed synthetic data generative model outperforms the vanilla GAN network. The distribution of VAE-GAN synthetic data is the most comparable to that of real data.
    NSGZero: Efficiently Learning Non-Exploitable Policy in Large-Scale Network Security Games with Neural Monte Carlo Tree Search. (arXiv:2201.07224v1 [cs.CR])
    (2 min) How resources are deployed to secure critical targets in networks can be modelled by Network Security Games (NSGs). While recent advances in deep learning (DL) provide a powerful approach to dealing with large-scale NSGs, DL methods such as NSG-NFSP suffer from the problem of data inefficiency. Furthermore, due to centralized control, they cannot scale to scenarios with a large number of resources. In this paper, we propose a novel DL-based method, NSGZero, to learn a non-exploitable policy in NSGs. NSGZero improves data efficiency by performing planning with neural Monte Carlo Tree Search (MCTS). Our main contributions are threefold. First, we design deep neural networks (DNNs) to perform neural MCTS in NSGs. Second, we enable neural MCTS with decentralized control, making NSGZero applicable to NSGs with many resources. Third, we provide an efficient learning paradigm, to achieve joint training of the DNNs in NSGZero. Compared to state-of-the-art algorithms, our method achieves significantly better data efficiency and scalability.
    Models for information propagation on graphs. (arXiv:2201.07577v1 [math.NA])
    (2 min) In this work we propose and unify classes of different models for information propagation over graphs. In a first class, propagation is modeled as a wave which emanates from a set of known nodes at an initial time, to all other unknown nodes at later times with an ordering determined by the time at which the information wave front reaches nodes. A second class of models is based on the notion of a travel time along paths between nodes. The time of information propagation from an initial known set of nodes to a node is defined as the minimum of a generalized travel time over subsets of all admissible paths. A final class is given by imposing a local equation of an eikonal form at each unknown node, with boundary conditions at the known nodes. The solution value of the local equation at a node is coupled the neighbouring nodes with smaller solution values. We provide precise formulations of the model classes in this graph setting, and prove equivalences between them. Motivated by the connection between first arrival time model and the eikonal equation in the continuum setting, we demonstrate that for graphs in the particular form of grids in Euclidean space mean field limits under grid refinement of certain graph models lead to Hamilton-Jacobi PDEs. For a specific parameter setting, we demonstrate that the solution on the grid approximates the Euclidean distance.
    Lipschitzness Is All You Need To Tame Off-policy Generative Adversarial Imitation Learning. (arXiv:2006.16785v3 [cs.LG] UPDATED)
    (2 min) Despite the recent success of reinforcement learning in various domains, these approaches remain, for the most part, deterringly sensitive to hyper-parameters and are often riddled with essential engineering feats allowing their success. We consider the case of off-policy generative adversarial imitation learning, and perform an in-depth review, qualitative and quantitative, of the method. We show that forcing the learned reward function to be local Lipschitz-continuous is a sine qua non condition for the method to perform well. We then study the effects of this necessary condition and provide several theoretical results involving the local Lipschitzness of the state-value function. We complement these guarantees with empirical evidence attesting to the strong positive effect that the consistent satisfaction of the Lipschitzness constraint on the reward has on imitation performance. Finally, we tackle a generic pessimistic reward preconditioning add-on spawning a large class of reward shaping methods, which makes the base method it is plugged into provably more robust, as shown in several additional theoretical guarantees. We then discuss these through a fine-grained lens and share our insights. Crucially, the guarantees derived and reported in this work are valid for any reward satisfying the Lipschitzness condition, nothing is specific to imitation. As such, these may be of independent interest.
    Enhancing the Security & Privacy of Wearable Brain-Computer Interfaces. (arXiv:2201.07711v1 [cs.CR])
    (2 min) Brain computing interfaces (BCI) are used in a plethora of safety/privacy-critical applications, ranging from healthcare to smart communication and control. Wearable BCI setups typically involve a head-mounted sensor connected to a mobile device, combined with ML-based data processing. Consequently, they are susceptible to a multiplicity of attacks across the hardware, software, and networking stacks used that can leak users' brainwave data or at worst relinquish control of BCI-assisted devices to remote attackers. In this paper, we: (i) analyse the whole-system security and privacy threats to existing wearable BCI products from an operating system and adversarial machine learning perspective; and (ii) introduce Argus, the first information flow control system for wearable BCI applications that mitigates these attacks. Argus' domain-specific design leads to a lightweight implementation on Linux ARM platforms suitable for existing BCI use-cases. Our proof of concept attacks on real-world BCI devices (Muse, NeuroSky, and OpenBCI) led us to discover more than 300 vulnerabilities across the stacks of six major attack vectors. Our evaluation shows Argus is highly effective in tracking sensitive dataflows and restricting these attacks with an acceptable memory and performance overhead (<15%).
    Online Deep Learning based on Auto-Encoder. (arXiv:2201.07383v1 [cs.LG])
    (2 min) Online learning is an important technical means for sketching massive real-time and high-speed data. Although this direction has attracted intensive attention, most of the literature in this area ignore the following three issues: (1) they think little of the underlying abstract hierarchical latent information existing in examples, even if extracting these abstract hierarchical latent representations is useful to better predict the class labels of examples; (2) the idea of preassigned model on unseen datapoints is not suitable for modeling streaming data with evolving probability distribution. This challenge is referred as model flexibility. And so, with this in minds, the online deep learning model we need to design should have a variable underlying structure; (3) moreover, it is of utmost importance to fusion these abstract hierarchical latent representations to achieve better classification performance, and we should give different weights to different levels of implicit representation information when dealing with the data streaming where the data distribution changes. To address these issues, we propose a two-phase Online Deep Learning based on Auto-Encoder (ODLAE). Based on auto-encoder, considering reconstruction loss, we extract abstract hierarchical latent representations of instances; Based on predictive loss, we devise two fusion strategies: the output-level fusion strategy, which is obtained by fusing the classification results of encoder each hidden layer; and feature-level fusion strategy, which is leveraged self-attention mechanism to fusion every hidden layer output. Finally, in order to improve the robustness of the algorithm, we also try to utilize the denoising auto-encoder to yield hierarchical latent representations. Experimental results on different datasets are presented to verify the validity of our proposed algorithm (ODLAE) outperforms several baselines.
    Dual Parameterization of Sparse Variational Gaussian Processes. (arXiv:2111.03412v2 [cs.LG] UPDATED)
    (2 min) Sparse variational Gaussian process (SVGP) methods are a common choice for non-conjugate Gaussian process inference because of their computational benefits. In this paper, we improve their computational efficiency by using a dual parameterization where each data example is assigned dual parameters, similarly to site parameters used in expectation propagation. Our dual parameterization speeds-up inference using natural gradient descent, and provides a tighter evidence lower bound for hyperparameter learning. The approach has the same memory cost as the current SVGP methods, but it is faster and more accurate.
    Bayesian Neural Hawkes Process for Event Uncertainty Prediction. (arXiv:2112.14474v2 [cs.LG] UPDATED)
    (2 min) Event data consisting of time of occurrence of the events arises in several real-world applications. Recent works have introduced neural network based point processes for modeling event-times, and were shown to provide state-of-the-art performance in predicting event-times. However, neural point process models lack a good uncertainty quantification capability on predictions. A proper uncertainty quantification over event modeling will help in better decision making for many practical applications. Therefore, we propose a novel point process model, Bayesian Neural Hawkes process (BNHP) which leverages uncertainty modelling capability of Bayesian models and generalization capability of the neural networks to model event occurrence times. We augment the model with spatio-temporal modeling capability where it can consider uncertainty over predicted time and location of the events. Experiments on simulated and real-world datasets show that BNHP significantly improves prediction performance and uncertainty quantification for modelling events.
    Semantic-Aware Implicit Neural Audio-Driven Video Portrait Generation. (arXiv:2201.07786v1 [cs.GR])
    (2 min) Animating high-fidelity video portrait with speech audio is crucial for virtual reality and digital entertainment. While most previous studies rely on accurate explicit structural information, recent works explore the implicit scene representation of Neural Radiance Fields (NeRF) for realistic generation. In order to capture the inconsistent motions as well as the semantic difference between human head and torso, some work models them via two individual sets of NeRF, leading to unnatural results. In this work, we propose Semantic-aware Speaking Portrait NeRF (SSP-NeRF), which creates delicate audio-driven portraits using one unified set of NeRF. The proposed model can handle the detailed local facial semantics and the global head-torso relationship through two semantic-aware modules. Specifically, we first propose a Semantic-Aware Dynamic Ray Sampling module with an additional parsing branch that facilitates audio-driven volume rendering. Moreover, to enable portrait rendering in one unified neural radiance field, a Torso Deformation module is designed to stabilize the large-scale non-rigid torso motions. Extensive evaluations demonstrate that our proposed approach renders more realistic video portraits compared to previous methods. Project page: https://alvinliu0.github.io/projects/SSP-NeRF
    Convergence of policy gradient for entropy regularized MDPs with neural network approximation in the mean-field regime. (arXiv:2201.07296v1 [math.OC])
    (2 min) We study the global convergence of policy gradient for infinite-horizon, continuous state and action space, entropy-regularized Markov decision processes (MDPs). We consider a softmax policy with (one-hidden layer) neural network approximation in a mean-field regime. Additional entropic regularization in the associated mean-field probability measure is added, and the corresponding gradient flow is studied in the 2-Wasserstein metric. We show that the objective function is increasing along the gradient flow. Further, we prove that if the regularization in terms of the mean-field measure is sufficient, the gradient flow converges exponentially fast to the unique stationary solution, which is the unique maximizer of the regularized MDP objective. Lastly, we study the sensitivity of the value function along the gradient flow with respect to regularization parameters and the initial condition. Our results rely on the careful analysis of non-linear Fokker--Planck--Kolmogorov equation and extend the pioneering work of Mei et al. 2020 and Agarwal et al. 2020, which quantify the global convergence rate of policy gradient for entropy-regularized MDPs in the tabular setting.
    Learning grammar with a divide-and-concur neural network. (arXiv:2201.07341v1 [cs.CL])
    (2 min) We implement a divide-and-concur iterative projection approach to context-free grammar inference. Unlike most state-of-the-art models of natural language processing, our method requires a relatively small number of discrete parameters, making the inferred grammar directly interpretable -- one can read off from a solution how to construct grammatically valid sentences. Another advantage of our approach is the ability to infer meaningful grammatical rules from just a few sentences, compared to the hundreds of gigabytes of training data many other models employ. We demonstrate several ways of applying our approach: classifying words and inferring a grammar from scratch, taking an existing grammar and refining its categories and rules, and taking an existing grammar and expanding its lexicon as it encounters new words in new data.
    On the Complexity of a Practical Primal-Dual Coordinate Method. (arXiv:2201.07684v1 [math.OC])
    (2 min) We prove complexity bounds for the primal-dual algorithm with random extrapolation and coordinate descent (PURE-CD), which has been shown to obtain good practical performance for solving convex-concave min-max problems with bilinear coupling. Our complexity bounds either match or improve the best-known results in the literature for both dense and sparse (strongly)-convex-(strongly)-concave problems.
    Efficient Training of Spiking Neural Networks with Temporally-Truncated Local Backpropagation through Time. (arXiv:2201.07210v1 [cs.NE])
    (2 min) Directly training spiking neural networks (SNNs) has remained challenging due to complex neural dynamics and intrinsic non-differentiability in firing functions. The well-known backpropagation through time (BPTT) algorithm proposed to train SNNs suffers from large memory footprint and prohibits backward and update unlocking, making it impossible to exploit the potential of locally-supervised training methods. This work proposes an efficient and direct training algorithm for SNNs that integrates a locally-supervised training method with a temporally-truncated BPTT algorithm. The proposed algorithm explores both temporal and spatial locality in BPTT and contributes to significant reduction in computational cost including GPU memory utilization, main memory access and arithmetic operations. We thoroughly explore the design space concerning temporal truncation length and local training block size and benchmark their impact on classification accuracy of different networks running different types of tasks. The results reveal that temporal truncation has a negative effect on the accuracy of classifying frame-based datasets, but leads to improvement in accuracy on dynamic-vision-sensor (DVS) recorded datasets. In spite of resulting information loss, local training is capable of alleviating overfitting. The combined effect of temporal truncation and local training can lead to the slowdown of accuracy drop and even improvement in accuracy. In addition, training deep SNNs models such as AlexNet classifying CIFAR10-DVS dataset leads to 7.26% increase in accuracy, 89.94% reduction in GPU memory, 10.79% reduction in memory access, and 99.64% reduction in MAC operations compared to the standard end-to-end BPTT.
    Learning to Rank For Push Notifications Using Pairwise Expected Regret. (arXiv:2201.07681v1 [cs.IR])
    (2 min) Listwise ranking losses have been widely studied in recommender systems. However, new paradigms of content consumption present new challenges for ranking methods. In this work we contribute an analysis of learning to rank for personalized mobile push notifications and discuss the unique challenges this presents compared to traditional ranking problems. To address these challenges, we introduce a novel ranking loss based on weighting the pairwise loss between candidates by the expected regret incurred for misordering the pair. We demonstrate that the proposed method can outperform prior methods both in a simulated environment and in a production experiment on a major social network.
    Explainable Ensemble Machine Learning for Breast Cancer Diagnosis based on Ultrasound Image Texture Features. (arXiv:2201.07227v1 [eess.IV])
    (2 min) Image classification is widely used to build predictive models for breast cancer diagnosis. Most existing approaches overwhelmingly rely on deep convolutional networks to build such diagnosis pipelines. These model architectures, although remarkable in performance, are black-box systems that provide minimal insight into the inner logic behind their predictions. This is a major drawback as the explainability of prediction is vital for applications such as cancer diagnosis. In this paper, we address this issue by proposing an explainable machine learning pipeline for breast cancer diagnosis based on ultrasound images. We extract first- and second-order texture features of the ultrasound images and use them to build a probabilistic ensemble of decision tree classifiers. Each decision tree learns to classify the input ultrasound image by learning a set of robust decision thresholds for texture features of the image. The decision path of the model predictions can then be interpreted by decomposing the learned decision trees. Our results show that our proposed framework achieves high predictive performance while being explainable.
    Malware Classification Using Static Disassembly and Machine Learning. (arXiv:2201.07649v1 [cs.CR])
    (2 min) Network and system security are incredibly critical issues now. Due to the rapid proliferation of malware, traditional analysis methods struggle with enormous samples. In this paper, we propose four easy-to-extract and small-scale features, including sizes and permissions of Windows PE sections, content complexity, and import libraries, to classify malware families, and use automatic machine learning to search for the best model and hyper-parameters for each feature and their combinations. Compared with detailed behavior-related features like API sequences, proposed features provide macroscopic information about malware. The analysis is based on static disassembly scripts and hexadecimal machine code. Unlike dynamic behavior analysis, static analysis is resource-efficient and offers complete code coverage, but is vulnerable to code obfuscation and encryption. The results demonstrate that features which work well in dynamic analysis are not necessarily effective when applied to static analysis. For instance, API 4-grams only achieve 57.96% accuracy and involve a relatively high dimensional feature set (5000 dimensions). In contrast, the novel proposed features together with a classical machine learning algorithm (Random Forest) presents very good accuracy at 99.40% and the feature vector is of much smaller dimension (40 dimensions). We demonstrate the effectiveness of this approach through integration in IDA Pro, which also facilitates the collection of new training samples and subsequent model retraining.
    ConDor: Self-Supervised Canonicalization of 3D Pose for Partial Shapes. (arXiv:2201.07788v1 [cs.CV])
    (2 min) Progress in 3D object understanding has relied on manually canonicalized shape datasets that contain instances with consistent position and orientation (3D pose). This has made it hard to generalize these methods to in-the-wild shapes, eg., from internet model collections or depth sensors. ConDor is a self-supervised method that learns to Canonicalize the 3D orientation and position for full and partial 3D point clouds. We build on top of Tensor Field Networks (TFNs), a class of permutation- and rotation-equivariant, and translation-invariant 3D networks. During inference, our method takes an unseen full or partial 3D point cloud at an arbitrary pose and outputs an equivariant canonical pose. During training, this network uses self-supervision losses to learn the canonical pose from an un-canonicalized collection of full and partial 3D point clouds. ConDor can also learn to consistently co-segment object parts without any supervision. Extensive quantitative results on four new metrics show that our approach outperforms existing methods while enabling new applications such as operation on depth images and annotation transfer.
  • stat.ML updates on arXiv.org

    Reconstructing Cosmic Polarization Rotation with ResUNet-CMB. (arXiv:2109.09715v2 [astro-ph.CO] UPDATED)
    (2 min) Cosmic polarization rotation, which may result from parity-violating new physics or the presence of primordial magnetic fields, converts $E$-mode polarization of the cosmic microwave background (CMB) into $B$-mode polarization. Anisotropic cosmic polarization rotation leads to statistical anisotropy in CMB polarization and can be reconstructed with quadratic estimator techniques similar to those designed for gravitational lensing of the CMB. At the sensitivity of upcoming CMB surveys, lensing-induced $B$-mode polarization will act as a limiting factor in the search for anisotropic cosmic polarization rotation, meaning that an analysis which incorporates some form of delensing will be required to improve constraints on the effect with future surveys. In this paper we extend the ResUNet-CMB convolutional neural network to reconstruct anisotropic cosmic polarization rotation in the presence of gravitational lensing and patchy reionization, and we show that the network simultaneously reconstructs all three effects with variance that is lower than that from the standard quadratic estimator nearly matching the performance of an iterative reconstruction method.
    Modelling and simulating spatial extremes by combining extreme value theory with generative adversarial networks. (arXiv:2111.00267v2 [stat.AP] UPDATED)
    (2 min) Modelling dependencies between climate extremes is important for climate risk assessment, for instance when allocating emergency management funds. In statistics, multivariate extreme value theory is often used to model spatial extremes. However, most commonly used approaches require strong assumptions and are either too simplistic or over-parameterized. From a machine learning perspective, Generative Adversarial Networks (GANs) are a powerful tool to model dependencies in high-dimensional spaces. Yet in the standard setting, GANs do not well represent dependencies in the extremes. Here we combine GANs with extreme value theory (evtGAN) to model spatial dependencies in summer maxima of temperature and winter maxima in precipitation over a large part of western Europe. We use data from a stationary 2000-year climate model simulation to validate the approach and explore its sensitivity to small sample sizes. Our results show that evtGAN outperforms classical GANs and standard statistical approaches to model spatial extremes. Already with about 50 years of data, which corresponds to commonly available climate records, we obtain reasonably good performance. In general, dependencies between temperature extremes are better captured than dependencies between precipitation extremes due to the high spatial coherence in temperature fields. Our approach can be applied to other climate variables and can be used to emulate climate models when running very long simulations to determine dependencies in the extremes is deemed infeasible.
    Linear Polytree Structural Equation Models: Structural Learning and Inverse Correlation Estimation. (arXiv:2107.10955v2 [stat.ML] UPDATED)
    (2 min) We are interested in the problem of learning the directed acyclic graph (DAG) when data are generated from a linear structural equation model (SEM) and the causal structure can be characterized by a polytree. Under the Gaussian polytree models, we study sufficient conditions on the sample sizes for the well-known Chow-Liu algorithm to exactly recover both the skeleton and the equivalence class of the polytree, which is uniquely represented by a CPDAG. On the other hand, necessary conditions on the required sample sizes for both skeleton and CPDAG recovery are also derived in terms of information-theoretic lower bounds, which match the respective sufficient conditions and thereby give a sharp characterization of the difficulty of these tasks. We also consider extensions to the sub-Gaussian case, and then study the estimation of the inverse correlation matrix under such models. Our theoretical findings are illustrated by comprehensive numerical simulations, and experiments on benchmark data also demonstrate the robustness of polytree learning when the true graphical structures can only be approximated by polytrees.
    Generalized XGBoost Method. (arXiv:2109.07473v2 [cs.LG] UPDATED)
    (2 min) The XGBoost method has many advantages and is especially suitable for statistical analysis of big data, but its loss function is limited to convex functions. In many specific applications, a nonconvex loss function would be preferable. In this paper, I propose a generalized XGBoost method, which requires weaker loss function constraint and involves more general loss functions, including convex loss functions and some non-convex loss functions. Furthermore, this generalized XGBoost method is extended to multivariate loss function to form a more generalized XGBoost method. This method is a multiobjective parameter regularized tree boosting method, which can model multiple parameters in most of the frequently-used parametric probability distributions to be fitted by predictor variables. Meanwhile, the related algorithms and some examples in non-life insurance pricing are given.
    Stein Variational Gaussian Processes. (arXiv:2009.12141v3 [stat.ML] UPDATED)
    (2 min) We show how to use Stein variational gradient descent (SVGD) to carry out inference in Gaussian process (GP) models with non-Gaussian likelihoods and large data volumes. Markov chain Monte Carlo (MCMC) is extremely computationally intensive for these situations, but the parametric assumptions required for efficient variational inference (VI) result in incorrect inference when they encounter the multi-modal posterior distributions that are common for such models. SVGD provides a non-parametric alternative to variational inference which is substantially faster than MCMC. We prove that for GP models with Lipschitz gradients the SVGD algorithm monotonically decreases the Kullback-Leibler divergence from the sampling distribution to the true posterior. Our method is demonstrated on benchmark problems in both regression and classification, a multimodal posterior, and an air quality example with 550,134 spatiotemporal observations, showing substantial performance improvements over MCMC and VI.
    Step-Ahead Error Feedback for Distributed Training with Compressed Gradient. (arXiv:2008.05823v2 [cs.LG] UPDATED)
    (2 min) Although the distributed machine learning methods can speed up the training of large deep neural networks, the communication cost has become the non-negligible bottleneck to constrain the performance. To address this challenge, the gradient compression based communication-efficient distributed learning methods were designed to reduce the communication cost, and more recently the local error feedback was incorporated to compensate for the corresponding performance loss. However, in this paper, we will show that a new "gradient mismatch" problem is raised by the local error feedback in centralized distributed training and can lead to degraded performance compared with full-precision training. To solve this critical problem, we propose two novel techniques, 1) step ahead and 2) error averaging, with rigorous theoretical analysis. Both our theoretical and empirical results show that our new methods can handle the "gradient mismatch" problem. The experimental results show that we can even train faster with common gradient compression schemes than both the full-precision training and local error feedback regarding the training epochs and without performance loss.
    Input Selection for Bandwidth-Limited Neural Network Inference. (arXiv:1906.04673v2 [cs.LG] UPDATED)
    (2 min) Data are often accommodated on centralized storage servers. This is the case, for instance, in remote sensing and astronomy, where projects produce several petabytes of data every year. While machine learning models are often trained on relatively small subsets of the data, the inference phase typically requires transferring significant amounts of data between the servers and the clients. In many cases, the bandwidth available per user is limited, which then renders the data transfer to be one of the major bottlenecks. In this work, we propose a framework that automatically selects the relevant parts of the input data for a given neural network. The model as well as the associated selection masks are trained simultaneously such that a good model performance is achieved while only a minimal amount of data is selected. During the inference phase, only those parts of the data have to be transferred between the server and the client. We propose both instance-independent and instance-dependent selection masks. The former ones are the same for all instances to be transferred, whereas the latter ones allow for variable transfer sizes per instance. Our experiments show that it is often possible to significantly reduce the amount of data needed to be transferred without affecting the model quality much.
    Learning Interpretable Models Using an Oracle. (arXiv:1906.06852v3 [cs.LG] UPDATED)
    (2 min) As Machine Learning (ML) becomes pervasive in various real world systems, the need for models to be understandable has increased. We focus on interpretability, noting that models often need to be constrained in size for them to be considered interpretable, e.g., a decision tree of depth 5 is easier to interpret than one of depth 50. But smaller models also tend to have high bias. This suggests a trade-off between interpretability and accuracy. We propose a model agnostic technique to minimize this trade-off. Our strategy is to first learn a powerful, possibly black-box, probabilistic model -- referred to as the oracle -- on the training data. Uncertainty in the oracle's predictions are used to learn a sampling distribution for the training data. The interpretable model is trained on a sample obtained using this distribution. We demonstrate that such a model often is significantly more accurate than one trained on the original data. Determining the sampling strategy is formulated as an optimization problem. Our solution to this problem possesses the following key favorable properties: (1) the number of optimization variables is independent of the dimensionality of the data: a fixed number of seven variables are used (2) our technique is model agnostic - in that both the interpretable model and the oracle may belong to arbitrary model families. Results using multiple real world datasets, using Linear Probability Models and Decision Trees as interpretable models, with Gradient Boosted Model and Random Forest as oracles, are presented. We observe significant relative improvements in the F1-score in most cases, occasionally seeing improvements greater than 100%. Additionally, we discuss an interesting application of our technique where a Gated Recurrent Unit network is used to improve the sequence classification accuracy of a Decision Tree that uses character n-grams as features.
    Machine Learning Calabi-Yau Hypersurfaces. (arXiv:2112.06350v2 [hep-th] UPDATED)
    (2 min) We revisit the classic database of weighted-P4s which admit Calabi-Yau 3-fold hypersurfaces equipped with a diverse set of tools from the machine-learning toolbox. Unsupervised techniques identify an unanticipated almost linear dependence of the topological data on the weights. This then allows us to identify a previously unnoticed clustering in the Calabi-Yau data. Supervised techniques are successful in predicting the topological parameters of the hypersurface from its weights with an accuracy of R^2 > 95%. Supervised learning also allows us to identify weighted-P4s which admit Calabi-Yau hypersurfaces to 100% accuracy by making use of partitioning supported by the clustering behaviour.
    A Regularized Limited Memory BFGS method for Large-Scale Unconstrained Optimization and its Efficient Implementations. (arXiv:2101.04413v1 [math.OC] CROSS LISTED)
    (2 min) The limited memory BFGS (L-BFGS) method is one of the popular methods for solving large-scale unconstrained optimization. Since the standard L-BFGS method uses a line search to guarantee its global convergence, it sometimes requires a large number of function evaluations. To overcome the difficulty, we propose a new L-BFGS with a certain regularization technique. We show its global convergence under the usual assumptions. In order to make the method more robust and efficient, we also extend it with several techniques such as nonmonotone technique and simultaneous use of the Wolfe line search. Finally, we present some numerical results for test problems in CUTEst, which show that the proposed method is robust in terms of solving number of problems.
    Coupled Support Tensor Machine Classification for Multimodal Neuroimaging Data. (arXiv:2201.07683v1 [stat.ML])
    (2 min) Multimodal data arise in various applications where information about the same phenomenon is acquired from multiple sensors and across different imaging modalities. Learning from multimodal data is of great interest in machine learning and statistics research as this offers the possibility of capturing complementary information among modalities. Multimodal modeling helps to explain the interdependence between heterogeneous data sources, discovers new insights that may not be available from a single modality, and improves decision-making. Recently, coupled matrix-tensor factorization has been introduced for multimodal data fusion to jointly estimate latent factors and identify complex interdependence among the latent factors. However, most of the prior work on coupled matrix-tensor factors focuses on unsupervised learning and there is little work on supervised learning using the jointly estimated latent factors. This paper considers the multimodal tensor data classification problem. A Coupled Support Tensor Machine (C-STM) built upon the latent factors jointly estimated from the Advanced Coupled Matrix Tensor Factorization (ACMTF) is proposed. C-STM combines individual and shared latent factors with multiple kernels and estimates a maximal-margin classifier for coupled matrix tensor data. The classification risk of C-STM is shown to converge to the optimal Bayes risk, making it a statistically consistent rule. C-STM is validated through simulation studies as well as a simultaneous EEG-fMRI analysis. The empirical evidence shows that C-STM can utilize information from multiple sources and provide a better classification performance than traditional single-mode classifiers.
    Scattering GCN: Overcoming Oversmoothness in Graph Convolutional Networks. (arXiv:2003.08414v4 [cs.LG] UPDATED)
    (2 min) Graph convolutional networks (GCNs) have shown promising results in processing graph data by extracting structure-aware features. This gave rise to extensive work in geometric deep learning, focusing on designing network architectures that ensure neuron activations conform to regularity patterns within the input graph. However, in most cases the graph structure is only accounted for by considering the similarity of activations between adjacent nodes, which limits the capabilities of such methods to discriminate between nodes in a graph. Here, we propose to augment conventional GCNs with geometric scattering transforms and residual convolutions. The former enables band-pass filtering of graph signals, thus alleviating the so-called oversmoothing often encountered in GCNs, while the latter is introduced to clear the resulting features of high-frequency noise. We establish the advantages of the presented Scattering GCN with both theoretical results establishing the complementary benefits of scattering and GCN features, as well as experimental results showing the benefits of our method compared to leading graph neural networks for semi-supervised node classification, including the recently proposed GAT network that typically alleviates oversmoothing using graph attention mechanisms.
    On the Complexity of a Practical Primal-Dual Coordinate Method. (arXiv:2201.07684v1 [math.OC])
    (2 min) We prove complexity bounds for the primal-dual algorithm with random extrapolation and coordinate descent (PURE-CD), which has been shown to obtain good practical performance for solving convex-concave min-max problems with bilinear coupling. Our complexity bounds either match or improve the best-known results in the literature for both dense and sparse (strongly)-convex-(strongly)-concave problems.
    RAMANMETRIX: a delightful way to analyze Raman spectra. (arXiv:2201.07586v1 [physics.data-an])
    (2 min) Although Raman spectroscopy is widely used for the investigation of biomedical samples and has a high potential for use in clinical applications, it is not common in clinical routines. One of the factors that obstruct the integration of Raman spectroscopic tools into clinical routines is the complexity of the data processing workflow. Software tools that simplify spectroscopic data handling may facilitate such integration by familiarizing clinical experts with the advantages of Raman spectroscopy. Here, RAMANMETRIX is introduced as a user-friendly software with an intuitive web-based graphical user interface (GUI) that incorporates a complete workflow for chemometric analysis of Raman spectra, from raw data pretreatment to a robust validation of machine learning models. The software can be used both for model training and for the application of the pretrained models onto new data sets. Users have full control of the parameters during model training, but the testing data flow is frozen and does not require additional user input. RAMANMETRIX is available in two versions: as standalone software and web application. Due to the modern software architecture, the computational backend part can be executed separately from the GUI and accessed through an application programming interface (API) for applying a preconstructed model to the measured data. This opens up possibilities for using the software as a data processing backend for the measurement devices in real-time. The models preconstructed by more experienced users can be exported and reused for easy one-click data preprocessing and prediction, which requires minimal interaction between the user and the software. The results of such prediction and graphical outputs of the different data processing steps can be exported and saved.
    Learning Tensor Representations for Meta-Learning. (arXiv:2201.07348v1 [cs.LG])
    (2 min) We introduce a tensor-based model of shared representation for meta-learning from a diverse set of tasks. Prior works on learning linear representations for meta-learning assume that there is a common shared representation across different tasks, and do not consider the additional task-specific observable side information. In this work, we model the meta-parameter through an order-$3$ tensor, which can adapt to the observed task features of the task. We propose two methods to estimate the underlying tensor. The first method solves a tensor regression problem and works under natural assumptions on the data generating process. The second method uses the method of moments under additional distributional assumptions and has an improved sample complexity in terms of the number of tasks. We also focus on the meta-test phase, and consider estimating task-specific parameters on a new task. Substituting the estimated tensor from the first step allows us estimating the task-specific parameters with very few samples of the new task, thereby showing the benefits of learning tensor representations for meta-learning. Finally, through simulation and several real-world datasets, we evaluate our methods and show that it improves over previous linear models of shared representations for meta-learning.
    Self-Supervised Metric Learning in Multi-View Data: A Downstream Task Perspective. (arXiv:2106.07138v3 [stat.ML] UPDATED)
    (2 min) Self-supervised metric learning has been a successful approach for learning a distance from an unlabeled dataset. The resulting distance is broadly useful for improving various distance-based downstream tasks, even when no information from downstream tasks is utilized in the metric learning stage. To gain insights into this approach, we develop a statistical framework to theoretically study how self-supervised metric learning can benefit downstream tasks in the context of multi-view data. Under this framework, we show that the target distance of metric learning satisfies several desired properties for the downstream tasks. On the other hand, our investigation suggests the target distance can be further improved by moderating each direction's weights. In addition, our analysis precisely characterizes the improvement by self-supervised metric learning on four commonly used downstream tasks: sample identification, two-sample testing, $k$-means clustering, and $k$-nearest neighbor classification. When the distance is estimated from an unlabeled dataset, we establish the upper bound on distance estimation's accuracy and the number of samples sufficient for downstream task improvement. Finally, numerical experiments are presented to support the theoretical results in the paper.
    On Dynamic Pricing with Covariates. (arXiv:2112.13254v2 [cs.LG] UPDATED)
    (2 min) We consider the dynamic pricing problem with covariates under a generalized linear demand model: a seller can dynamically adjust the price of a product over a horizon of $T$ time periods, and at each time period $t$, the demand of the product is jointly determined by the price and an observable covariate vector $x_t\in\mathbb{R}^d$ through an unknown generalized linear model. Most of the existing literature assumes the covariate vectors $x_t$'s are independently and identically distributed (i.i.d.); the few papers that relax this assumption either sacrifice model generality or yield sub-optimal regret bounds. In this paper we show that a simple pricing algorithm has an $O(d\sqrt{T}\log T)$ regret upper bound without assuming any statistical structure on the covariates $x_t$ (which can even be arbitrarily chosen). The upper bound on the regret matches the lower bound (even under the i.i.d. assumption) up to logarithmic factors. Our paper thus shows that (i) the i.i.d. assumption is not necessary for obtaining low regret, and (ii) the regret bound can be independent of the (inverse) minimum eigenvalue of the covariance matrix of the $x_t$'s, a quantity present in previous bounds. Furthermore, we discuss a condition under which a better regret is achievable and how a Thompson sampling algorithm can be applied to give an efficient computation of the prices.
    Linear Convergence of the Subspace Constrained Mean Shift Algorithm: From Euclidean to Directional Data. (arXiv:2104.14977v2 [stat.ML] UPDATED)
    (2 min) This paper studies the linear convergence of the subspace constrained mean shift (SCMS) algorithm, a well-known algorithm for identifying a density ridge defined by a kernel density estimator. By arguing that the SCMS algorithm is a special variant of a subspace constrained gradient ascent (SCGA) algorithm with an adaptive step size, we derive the linear convergence of such SCGA algorithm. While the existing research focuses mainly on density ridges in the Euclidean space, we generalize density ridges and the SCMS algorithm to directional data. In particular, we establish the stability theorem of density ridges with directional data and prove the linear convergence of our proposed directional SCMS algorithm.
    Geometric Scattering Attention Networks. (arXiv:2010.15010v2 [cs.LG] UPDATED)
    (2 min) Geometric scattering has recently gained recognition in graph representation learning, and recent work has shown that integrating scattering features in graph convolution networks (GCNs) can alleviate the typical oversmoothing of features in node representation learning. However, scattering often relies on handcrafted design, requiring careful selection of frequency bands via a cascade of wavelet transforms, as well as an effective weight sharing scheme to combine low- and band-pass information. Here, we introduce a new attention-based architecture to produce adaptive task-driven node representations by implicitly learning node-wise weights for combining multiple scattering and GCN channels in the network. We show the resulting geometric scattering attention network (GSAN) outperforms previous networks in semi-supervised node classification, while also enabling a spectral study of extracted information by examining node-wise attention weights.
    Ordinal Causal Discovery. (arXiv:2201.07396v1 [stat.ME])
    (2 min) Causal discovery for purely observational, categorical data is a long-standing challenging problem. Unlike continuous data, the vast majority of existing methods for categorical data focus on inferring the Markov equivalence class only, which leaves the direction of some causal relationships undetermined. This paper proposes an identifiable ordinal causal discovery method that exploits the ordinal information contained in many real-world applications to uniquely identify the causal structure. Simple score-and-search algorithms are developed for structure learning. The proposed method is applicable beyond ordinal data via data discretization. Through real-world and synthetic experiments, we demonstrate that the proposed ordinal causal discovery method has favorable and robust performance compared to state-of-the-art alternative methods in both ordinal categorical and non-categorical data.
    On the Acceleration of Deep Learning Model Parallelism with Staleness. (arXiv:1909.02625v3 [cs.LG] UPDATED)
    (2 min) Training the deep convolutional neural network for computer vision problems is slow and inefficient, especially when it is large and distributed across multiple devices. The inefficiency is caused by the backpropagation algorithm's forward locking, backward locking, and update locking problems. Existing solutions for acceleration either can only handle one locking problem or lead to severe accuracy loss or memory inefficiency. Moreover, none of them consider the straggler problem among devices. In this paper, we propose Layer-wise Staleness and a novel efficient training algorithm, Diversely Stale Parameters (DSP), to address these challenges. We also analyze the convergence of DSP with two popular gradient-based methods and prove that both of them are guaranteed to converge to critical points for non-convex problems. Finally, extensive experimental results on training deep learning models demonstrate that our proposed DSP algorithm can achieve significant training speedup with stronger robustness than compared methods.
    Learning Inconsistent Preferences with Gaussian Processes. (arXiv:2006.03847v2 [stat.ML] UPDATED)
    (2 min) We revisit widely used preferential Gaussian processes by Chu et al.(2005) and challenge their modelling assumption that imposes rankability of data items via latent utility function values. We propose a generalisation of pgp which can capture more expressive latent preferential structures in the data and thus be used to model inconsistent preferences, i.e. where transitivity is violated, or to discover clusters of comparable items via spectral decomposition of the learned preference functions. We also consider the properties of associated covariance kernel functions and its reproducing kernel Hilbert Space (RKHS), giving a simple construction that satisfies universality in the space of preference functions. Finally, we provide an extensive set of numerical experiments on simulated and real-world datasets showcasing the competitiveness of our proposed method with state-of-the-art. Our experimental findings support the conjecture that violations of rankability are ubiquitous in real-world preferential data.
    Physics Informed Deep Kernel Learning. (arXiv:2006.04976v2 [stat.ML] UPDATED)
    (2 min) Deep kernel learning is a promising combination of deep neural networks and nonparametric function learning. However, as a data driven approach, the performance of deep kernel learning can still be restricted by scarce or insufficient data, especially in extrapolation tasks. To address these limitations, we propose Physics Informed Deep Kernel Learning (PI-DKL) that exploits physics knowledge represented by differential equations with latent sources. Specifically, we use the posterior function sample of the Gaussian process as the surrogate for the solution of the differential equation, and construct a generative component to integrate the equation in a principled Bayesian hybrid framework. For efficient and effective inference, we marginalize out the latent variables in the joint probability and derive a collapsed model evidence lower bound (ELBO), based on which we develop a stochastic model estimation algorithm. Our ELBO can be viewed as a nice, interpretable posterior regularization objective. On synthetic datasets and real-world applications, we show the advantage of our approach in both prediction accuracy and uncertainty quantification.
    Dual Parameterization of Sparse Variational Gaussian Processes. (arXiv:2111.03412v2 [cs.LG] UPDATED)
    (2 min) Sparse variational Gaussian process (SVGP) methods are a common choice for non-conjugate Gaussian process inference because of their computational benefits. In this paper, we improve their computational efficiency by using a dual parameterization where each data example is assigned dual parameters, similarly to site parameters used in expectation propagation. Our dual parameterization speeds-up inference using natural gradient descent, and provides a tighter evidence lower bound for hyperparameter learning. The approach has the same memory cost as the current SVGP methods, but it is faster and more accurate.
    Beta Shapley: a Unified and Noise-reduced Data Valuation Framework for Machine Learning. (arXiv:2110.14049v2 [cs.LG] UPDATED)
    (2 min) Data Shapley has recently been proposed as a principled framework to quantify the contribution of individual datum in machine learning. It can effectively identify helpful or harmful data points for a learning algorithm. In this paper, we propose Beta Shapley, which is a substantial generalization of Data Shapley. Beta Shapley arises naturally by relaxing the efficiency axiom of the Shapley value, which is not critical for machine learning settings. Beta Shapley unifies several popular data valuation methods and includes data Shapley as a special case. Moreover, we prove that Beta Shapley has several desirable statistical properties and propose efficient algorithms to estimate it. We demonstrate that Beta Shapley outperforms state-of-the-art data valuation methods on several downstream ML tasks such as: 1) detecting mislabeled training data; 2) learning with subsamples; and 3) identifying points whose addition or removal have the largest positive or negative impact on the model.
    Lifted Primal-Dual Method for Bilinearly Coupled Smooth Minimax Optimization. (arXiv:2201.07427v1 [math.OC])
    (2 min) We study the bilinearly coupled minimax problem: $\min_{x} \max_{y} f(x) + y^\top A x - h(y)$, where $f$ and $h$ are both strongly convex smooth functions and admit first-order gradient oracles. Surprisingly, no known first-order algorithms have hitherto achieved the lower complexity bound of $\Omega((\sqrt{\frac{L_x}{\mu_x}} + \frac{\|A\|}{\sqrt{\mu_x \mu_y}} + \sqrt{\frac{L_y}{\mu_y}}) \log(\frac1{\varepsilon}))$ for solving this problem up to an $\varepsilon$ primal-dual gap in the general parameter regime, where $L_x, L_y,\mu_x,\mu_y$ are the corresponding smoothness and strongly convexity constants. We close this gap by devising the first optimal algorithm, the Lifted Primal-Dual (LPD) method. Our method lifts the objective into an extended form that allows both the smooth terms and the bilinear term to be handled optimally and seamlessly with the same primal-dual framework. Besides optimality, our method yields a desirably simple single-loop algorithm that uses only one gradient oracle call per iteration. Moreover, when $f$ is just convex, the same algorithm applied to a smoothed objective achieves the nearly optimal iteration complexity. We also provide a direct single-loop algorithm, using the LPD method, that achieves the iteration complexity of $O(\sqrt{\frac{L_x}{\varepsilon}} + \frac{\|A\|}{\sqrt{\mu_y \varepsilon}} + \sqrt{\frac{L_y}{\varepsilon}})$. Numerical experiments on quadratic minimax problems and policy evaluation problems further demonstrate the fast convergence of our algorithm in practice.
    Using machine learning to parametrize postmerger signals from binary neutron stars. (arXiv:2201.06461v1 [gr-qc] CROSS LISTED)
    (2 min) There is growing interest in the detection and characterization of gravitational waves from postmerger oscillations of binary neutron stars. These signals contain information about the nature of the remnant and the high-density and out-of-equilibrium physics of the postmerger processes, which would complement any electromagnetic signal. However, the construction of binary neutron star postmerger waveforms is much more complicated than for binary black holes: (i) there are theoretical uncertainties in the neutron-star equation of state and other aspects of the high-density physics, (ii) numerical simulations are expensive and available ones only cover a small fraction of the parameter space with limited numerical accuracy, and (iii) it is unclear how to parametrize the theoretical uncertainties and interpolate across parameter space. In this work, we describe the use of a machine-learning method called a conditional variational autoencoder (CVAE) to construct postmerger models for hyper/massive neutron star remnant signals based on numerical-relativity simulations. The CVAE provides a probabilistic model, which encodes uncertainties in the training data within a set of latent parameters. We estimate that training such a model will ultimately require $\sim 10^4$ waveforms. However, using synthetic training waveforms as a proof-of-principle, we show that the CVAE can be used as an accurate generative model and that it encodes the equation of state in a useful latent representation.
    Deep Capsule Encoder-Decoder Network for Surrogate Modeling and Uncertainty Quantification. (arXiv:2201.07753v1 [stat.ML])
    (2 min) We propose a novel \textit{capsule} based deep encoder-decoder model for surrogate modeling and uncertainty quantification of systems in mechanics from sparse data. The proposed framework is developed by adapting Capsule Network (CapsNet) architecture into image-to-image regression encoder-decoder network. Specifically, the aim is to exploit the benefits of CapsNet over convolution neural network (CNN) $-$ retaining pose and position information related to an entity to name a few. The performance of proposed approach is illustrated by solving an elliptic stochastic partial differential equation (SPDE), which also governs systems in mechanics such as steady heat conduction, ground water flow or other diffusion processes, based uncertainty quantification problem with an input dimensionality of $1024$. However, the problem definition does not the restrict the random diffusion field to a particular covariance structure, and the more strenuous task of response prediction for an arbitrary diffusion field is solved. The obtained results from performance evaluation indicate that the proposed approach is accurate, efficient, and robust.
    The Elliptical Potential Lemma for General Distributions with an Application to Linear Thompson Sampling. (arXiv:2102.07987v3 [stat.ML] UPDATED)
    (2 min) In this note, we introduce a general version of the well-known elliptical potential lemma that is a widely used technique in the analysis of algorithms in sequential learning and decision-making problems. We consider a stochastic linear bandit setting where a decision-maker sequentially chooses among a set of given actions, observes their noisy rewards, and aims to maximize her cumulative expected reward over a decision-making horizon. The elliptical potential lemma is a key tool for quantifying uncertainty in estimating parameters of the reward function, but it requires the noise and the prior distributions to be Gaussian. Our general elliptical potential lemma relaxes this Gaussian requirement which is a highly non-trivial extension for a number of reasons; unlike the Gaussian case, there is no closed-form solution for the covariance matrix of the posterior distribution, the covariance matrix is not a deterministic function of the actions, and the covariance matrix is not decreasing with respect to the semidefinite inequality. While this result is of broad interest, we showcase an application of it to prove an improved Bayesian regret bound for the well-known Thompson sampling algorithm in stochastic linear bandits with changing action sets where prior and noise distributions are general. This bound is minimax optimal up to constants.
  • Google AI Blog

    LaMDA: Towards Safe, Grounded, and High-Quality Dialog Models for Everything
    (8 min) Posted by Heng-Tze Cheng, Senior Staff Software Engineer and Romal Thoppilan, Senior Software Engineer, Google Research, Brain Team Language models are becoming more capable than ever before and are helpful in a variety of tasks — translating one language into another, summarizing a long document into a brief highlight, or answering information-seeking questions. Among these, open-domain dialog, where a model needs to be able to converse about any topic, is probably one of the most difficult, with a wide range of potential applications and open challenges. In addition to producing responses that humans judge as sensible, interesting, and specific to the context, dialog models should adhere to Responsible AI practices, and avoid making factual statements that are not supported by external …
  • Data Science Central

    Digital Twin Technology: Reconciling the Physical and Digital
    (2 min) The digital twin concept is a rising digital profile of a procedure or a physical product speaking to its utilitarian and behavioral qualities utilized for execution enhancement. The innovation has empowered augmentation of the genuine and the virtual world through constant digital portrayals of physical products that can reach out to each phase of the… Read More »Digital Twin Technology: Reconciling the Physical and Digital The post Digital Twin Technology: Reconciling the Physical and Digital appeared first on Data Science Central.
    Deciphering the SQL Fix Recovery Model
    (5 min) Recovery models are database configurations that determine the nature of backup that a person performs and provide the chance of restoring data and recovering it from failure. The model decides the maintenance of the database’s transaction log as it reflects information modifications in a sequence. You may use it later on for database restore operations.… Read More »Deciphering the SQL Fix Recovery Model The post Deciphering the SQL Fix Recovery Model appeared first on Data Science Central.
  • Blog

    Setting Breakpoints and Exception Hooks in Python
    (14 min) There are different ways of debugging code in Python, one of which is to introduce breakpoints into the code at […] The post Setting Breakpoints and Exception Hooks in Python appeared first on Machine Learning Mastery.
  • MIT News - Artificial intelligence

    Computing for ocean environments
    (7 min) MIT ocean and mechanical engineers are using advances in scientific computing to address the ocean’s many challenges, and seize its opportunities.
    Seeing into the future: Personalized cancer screening with artificial intelligence
    (7 min) Scientists demonstrate that AI-risk models, paired with AI-designed screening policies, can offer significant and equitable improvements to cancer screening.
    Scientists make first detection of exotic “X” particles in quark-gluon plasma
    (6 min) The findings could redefine the kinds of particles that were abundant in the early universe.
  • AI Weirdness

    New breakfast cereals from AI
    (2 min) The thing about working with a giant language model like GPT-3 is it has read parts of the internet that it never occurred to me might exist. Like press releases from breakfast cereal companies, articles about press releases from breakfast cereal companies, blogs by breakfast cereal enthusiasts, and probably every
    Bonus: Get your breakfast quirkles!
    (1 min) AI Weirdness: the strange side of machine learning
  • Becoming Human: Artificial Intelligence Magazine - Medium

    Top 13 RPA Challenges and Ways to Overcome Them
    (10 min) Robotic process automation (RPA) projects are a challenging endeavor. According to a recent survey, 69% of RPA projects fail to take off…
  • John D. Cook

    Modal Logic and Science Fiction
    (2 min) Modal logic extends classical logic by adding one or more modes. If there’s only one mode, it’s usually denoted □. Curiously,  □ can have a wide variety of interpretations, and different interpretations motivate different axioms for how □ behaves. Modal logic is not one system but an infinite number of systems, depending on your choice […] Modal Logic and Science Fiction first appeared on John D. Cook.
  • Artificial Intelligence

    Microsoft AI Team Proposes DeepSpeed MoE Model: An End-to-End MoE Training and Inference Solution as Part of the DeepSpeed Library
    (1 min) The mixture of experts (MoE) is a promising deep learning model architecture that can minimize training cost complexity to several sublinear parameters. An ensemble learning technique is used in MoE architectures to break down modeling jobs into sub-tasks and train an expert model for each. A gating model learns which expert to trust and then combines the predictions based on the input to be forecasted. The availability and capability of hardware resources are pushed to the limit while training huge dense models. As a result, a good model architecture that could considerably reduce training costs compared to a quality-equivalent dense model is needed. Continue Reading Paper: https://arxiv.org/pdf/2201.05596.pdf Github: https://github.com/microsoft/DeepSpeed submitted by /u/techsucker [link] [comments]
    Can somebody provide me with a practical explanation of the mean average precision (mAP) evaluation metric?
    (1 min) I am working with object detection algorithms (I'm not a computer scientist), and have computed the precision, recall and therefore the mean average precision curve for about 450 test images. The resulting precision-recall curve looks something like this: ​ A drawing of the precision-recall curve that I generated for 450 test images. What is the practical application of the mAP? Is this saying that 77.6% of the time, the algorithm will correctly predict the right objects? submitted by /u/EnvironmentalLemon36 [link] [comments]
    The Key Process of Intelligence that AI is Still Missing
    (0 min) submitted by /u/FizzyP0p [link] [comments]
    Best self-paced course on conversational AI
    (0 min) Is there a good online (self-paced) course that covers the basic of conversational AI? I'd like to learn more about the fundamentals behind things like GPT-3, one-shot vs. multi-shot learning, etc. Thinking maybe some place like Coursera or Udemy might have something but haven't found anything yet. submitted by /u/LiveToSee22 [link] [comments]
    Show us AI that used in NFT 😊 thanks
    (1 min) May be making 3d nft collection May be advertising on discord telegram and so..... May be moving the Art Show us what you got so that we can help each other submitted by /u/anaszewail [link] [comments]
    Top 10 Explicit Content Detection APIs
    (0 min) submitted by /u/tah_zem [link] [comments]
    Meta releases Data2vec, a more generalized/multimodal learning framework.
    (0 min) submitted by /u/No-Transition-6630 [link] [comments]
    Top Emerging Computer Vision Trends For 2022 (with some of the recommended papers list to read)
    (0 min) submitted by /u/techsucker [link] [comments]
    Artificial Nightmares : Maleficent's Curse || Pytti VQGAN AI Art Video [4K 60 FPS]
    (0 min) submitted by /u/Thenamessd [link] [comments]
  • Machine Learning

    Salesforce AI Residency Program [R].
    (0 min) Hi all, This thread is for everyone who has applied to the Salesforce AI Residency program in 2022. The applications closed on January 3rd and the online coding test was also sent out. Has anyone heard about the interviews? Thank you! Wishing everyone the best of luck with their applications. :) submitted by /u/Away-Conference7879 [link] [comments]
    [D] First Author Interview - Dynamic Inference with Neural Interpreters (Video Walkthrough & Author Interview)
    (1 min) https://youtu.be/w3knicSHx5s This video includes an interview with the paper's authors! What if we treated deep networks like modular programs? Neural Interpreters divide computation into small modules and route data to them via a dynamic type inference system. The resulting model combines recurrent elements, weight sharing, attention, and more to tackle both abstract reasoning, as well as computer vision tasks. OUTLINE: 0:00 - Intro & Overview 3:00 - Model Overview 7:00 - Interpreter weights and function code 9:40 - Routing data to functions via neural type inference 14:55 - ModLin layers 18:25 - Experiments 21:35 - Interview Start 24:50 - General Model Structure 30:10 - Function code and signature 40:30 - Explaining Modulated Layers 49:50 - A closer look at weight sharing 58:30 - Experimental Results ​ Paper: https://arxiv.org/abs/2110.06399 ​ Guests: Nasim Rahaman: https://twitter.com/nasim_rahaman Francesco Locatello: https://twitter.com/FrancescoLocat8 Waleed Gondal: https://twitter.com/Wallii_gondal submitted by /u/ykilcher [link] [comments]
    [N] Easily load and upload Stable-baselines3 models from the Hugging Face Hub
    (1 min) Hey there , I'm Thomas Simonini from Hugging Face, I’m happy to announce that we just integrated Stable-Baselines3 to the Hugging Face Hub. You can now: Host your saved models Load powerful trained models from the community For instance, with these lines of codes I can load a trained agent playing Space Invaders: https://i.redd.it/1n97tppif2d81.gif If you want to start to use it, I wrote a tutorial https://huggingface.co/blog/sb3 I would love to hear your feedback about it, submitted by /u/cranthir_ [link] [comments]
    [Discussion] A chrome/edge extension to get ratings info on openreview site quickly.
    (0 min) A chrome/edge extension operates only on openreview sites, which allows you to get the ratings/decision info quickly, instead of SCROLLING all page. Openreview Quicklook ​ https://preview.redd.it/0p8ozmp1d2d81.png?width=2880&format=png&auto=webp&s=74240f54260206ed2fb2db98065c2e6350588450 It is available on Chrome and Edge. Source. submitted by /u/weiguoqiang [link] [comments]
    [P]benchmark datasets?
    (1 min) hello I’m working on a new ML algo that wants some quick performance comparisons with benchmarks. Please suggest some clean datasets (i.e. no need to impute) having a small number of features (e.g. 30?) with known results from other ML algos? Both classification and regression tasks are welcome. thanks :) submitted by /u/reddittomtom [link] [comments]
    [D] How often you fall into data leakage
    (2 min) I mostly work in EEG and it's often I fall into data leakge, but later I came to know that there is some data leakage. Some time I even reported the results. When I came to know myself that there is data leakage,I feel embarrassed. I want to know how often you faced this issue? And how you tackle it? I have given a presentation with good results around 70to80% accuracy. But now I came to know there is data leakage. For example I created a function which contains a keras model. In the beginning of function I wrote clear_session(). And for cross validation I defined it before loop. I should define it inside the loop. The thing making me confused is that as the folds increase the results should be improved but this is not the case. Here are the results Fold 1: 0.66 Fold 2: 0.87 Fold 3: 0.96 Fold 4: 0.82 Fold 5: 0.92 And how tell supervisor the results I showed earlier are useless submitted by /u/mrtac96 [link] [comments]
    [D] Do you think the industrial space (mainly manufacturing and utility companies in the EU) are ready for machine learning solutions?
    (2 min) If you want to dive into Industry 4.0./Digital transformation talks as well, by all means. I'm curious about your perspective and how you think it'll be possible for industrial companies to start on the ML journey. ​ Are they still using pen & paper, so they're too stuck in the old ways? Do they know what's possible for their operations? Are they reluctant to change (remove/implement) things? submitted by /u/zenithie [link] [comments]
    [R] Editing real videos with StyleGAN
    (1 min) ​ Hi all, The Author is here. Check the project page of our new paper "Stitch it in Time: GAN-Based Facial Editing of Real Videos". Using a standard pre-trained StyleGAN2, we can perform various edits over real videos: smile, older, younger, angry, etc. Our key insight is that a temporal model is not necessary. Instead, we maintain the temporal consistency of the original video by choosing the correct tools to invert and edit the video. Finally, we perform additional tuning to stitch the edited crop back to the video. Remarkably, it even works on out-of-domain cartoons. Don't miss the videos! our main contribution is the temporal smoothness. ​ Feel free to ask anything that comes to your mind ​ ​ https://preview.redd.it/mmm6q94s41d81.png?width=2878&format=png&auto=webp&s=d1d3459607d84a3477158152f44cdea54e3d87a5 https://preview.redd.it/fpoxc84s41d81.png?width=2830&format=png&auto=webp&s=5c67e60d2f56a54cd8ba68888cbd897fa2e06350 submitted by /u/RonMokady [link] [comments]
    [D] Tips for ML workflow on raw data
    (6 min) I am an ML researcher and work on real-world problems. Unlike core ML research where datasets are pretty much defined (in most cases), I usually deal with the raw data. We have live data (time-series) recorded and available at the servers. We have different objects and each object can have different sensors. Experiments usually compare different sensors among the same objects and the other way around. And it can be more than 100k records for one sensor/object. Currently, my workflow looks as follows: Pull the data from the server to my local machine Run the scripts to clean and "annotate" the data (+ analysis) Storing cleaned/annotated data as csv Forming/exploring the right design matrix (+ analysis) Exploring different models and training them Error analysis Since I deal with raw data, I sometimes end up spending a lot of time on steps 2 and 4. And then it becomes painful to keep track of code as well parameters of preprocessing. And the problem grows exponentially if I start looking into different models or start approaching a bit different nature of an experiment. How do you guys approach it? What does your workflow look like? Any tips would be really appreciated. Thank you. submitted by /u/muaz_usmani [link] [comments]
    [P] BERN2: an advanced neural biomedical named entity recognition and normalization tool
    (1 min) ​ https://preview.redd.it/qj3fd2o0f0d81.png?width=1262&format=png&auto=webp&s=bd69861c00db46395b36bc79682e95d9f65d67cc Abstract In biomedical natural language processing, named entity recognition (NER) and named entity normalization (NEN) are key tasks that enable the automatic extraction of biomedical entities (e.g., diseases and chemicals) from the ever-growing biomedical literature. In this paper, we present BERN2 (Advanced Biomedical Entity Recognition and Normalization), a tool that improves the previous neural network-based NER tool (Kim et al., 2019) by employing a multi-task NER model and neural network-based NEN models to achieve much faster and more accurate inference. We hope that our tool can help annotate large-scale biomedical texts more accurately for various tasks such as biomedical knowledge graph construction. *Web Service: http://bern2.korea.ac.kr *Github: https://github.com/dmis-lab/BERN2 *Paper: http://arxiv.org/abs/2201.02080 submitted by /u/bobae_papa [link] [comments]
    [P] Simple Tensorflow implementation of "Image Generators with Conditionally-Independent Pixel Synthesis (CVPR 2021, Oral)"
    (1 min) ​ teaser.png Abstract Existing image generator networks rely heavily on spatial convolutions and, optionally, self-attention blocks in order to gradually synthesize images in a coarse-to-fine manner. Here, we present a new architecture for image generators, where the color value at each pixel is computed independently given the value of a random latent vector and the coordinate of that pixel. No spatial convolutions or similar operations that propagate information across pixels are involved during the synthesis. We analyze the modeling capa- bilities of such generators when trained in an adversarial fashion, and observe the new generators to achieve similar generation quality to state-of-the-art convolutional generators. We also investigate several interesting properties unique to the new architecture. \ paper:* https://arxiv.org/abs/2011.13775 \ code:* https://github.com/taki0112/CIPS-Tensorflow submitted by /u/taki0112 [link] [comments]
    [D] Term for common input in a multi-branch network
    (1 min) I am trying to find terms that signify a CNN/deep model where there are multiple branches, but each branch gets the same input. In addition, this common input is processed with at least 1 CNN layer itself. An example (see image) of the overall architecture I am looking for would be a model that takes any ResNet base model, and splits it up by the bottleneck layers (1 through 4). Then, an image can be processed with the bottleneck layer 1. Subsequently, we have multiple branches, each branch being a copy of ResNet's layers [2,3,4] together. Assume three branches. So the output of ResNet bottleneck layer 1 is copied in triplicate, and passed to branches 1,2, and 3, each of which is ResNet bottleneck layers [2,3,4]. W.r.t. the image, the first one is a regular single branch resnet, and the second is the architecture I am looking for. What is this type of architecture called, if it has a name? Conversely, if you have any papers that use this architecture, I would appreciate those as well, so I could at least get a hang of any terminology authors use. Model mockup image submitted by /u/asuprem [link] [comments]
  • Neural Networks, Deep Learning and Machine Learning

    Dragon Ball - Intro PT HD
    (0 min) https://youtube.com/watch?v=9n83cAvqJk0&feature=share Intro to Dragon Ball PT, upscaling it using Topaz Video Enhance AI, version 2.3.0. Initial resolution - 640x480 Final resolution - 2560x1920 submitted by /u/3dforlife [link] [comments]
    How to pass multiple vectors to a RNN/LSTM neural network and get output as a vector. Can someone explain or give reference to code/text . It will be great help . Thanks
    (1 min) submitted by /u/aabra__ka__daabra [link] [comments]
  • Reinforcement Learning

    Can anyone help me with MuJoCo installation on Mac with M1 chip ?
    (1 min) It's okay if it has to be ‏‏‎ intricate, I just need to obtain it running on my Mac. submitted by /u/Icy_Egg9244 [link] [comments]
    Looking to hire: RL expert to build tennis playing robot
    (1 min) I am currently working for a client looking to expand their machine learning team with an RL expert. This is a greenfield project is to first simulate then build a robot to return tennis serves. Perks: - Technical freedom, pick the tools you know and develop overall solution architecture as you see fit - No bureaucracy, small team - Competative rate Time zone is western Europe, remote working. Drop me a message or comment if you are interested! submitted by /u/InterGalacticMedium [link] [comments]
    DeepStack/PoG DeepMinders leave for algorithmic trading startup (Martin Schmid/Matej Moravcik/Rudolf Kadlec)
    (1 min) submitted by /u/gwern [link] [comments]
    Easily load and upload Stable-baselines3 models from the Hugging Face Hub 🤗
    (1 min) Hey there 👋, I'm Thomas Simonini from Hugging Face 🤗, I’m happy to announce that we just integrated Stable-Baselines3 to the Hugging Face Hub. You can now: Host your saved models 💾 Load powerful trained models from the community 🔥 Both of them for free. For instance, with these lines of codes I can load a trained agent playing Space Invaders: https://i.redd.it/40zyf6m922d81.gif If you want to start to use it, I wrote a tutorial 👉 https://huggingface.co/blog/sb3 I would love to hear your feedback about it ❤️, At Hugging Face, we are contributing to the ecosystem for Deep Reinforcement Learning researchers and enthusiasts and in the coming weeks and months, we will be extending the ecosystem by: Integrating RL-baselines3-zoo Uploading RL-trained-agents models into the 🤗 Hub: a big collection of pre-trained reinforcement learning agents using stable-baselines3. Integrating other Deep Reinforcement Learning libraries Implementing Decision Transformers 🔥 And more to come 🥳 📢 The best way to keep in touch is to join our discord server to exchange with us and with the community. Thanks! submitted by /u/cranthir_ [link] [comments]
    Ray PPO Hyperparameter tuning help
    (1 min) Howdy! I am fairly new to the realm of reinforcement learning, and I need some advice for hyperparameter tuning based on the output of Ray's rllib graphs for model performance. I am getting some weird graphs, and I know they are not correct, as my model is not even coming close to converging, it keeps getting lost. Reward shaping is also in the question, which can be elaborated on if necessary. Please let me know if I am missing any graphs that would be useful. Using default Ray PPO config, with these overrides config['num_workers'] = 6 config['num_cpus_per_worker'] = 0.25 config['train_batch_size'] = 20 config['sgd_minibatch_size'] = 5 config['num_envs_per_worker'] = 3 config["entropy_coeff"] = 0.005 config["entropy_coeff_schedule"]= [[0,0.005],[ending,0]] config["grad_clip"] = 40 config["vf_loss_coeff"] = 20 config["model"]["vf_share_layers"] = True config["batch_mode"]: "complete_episodes" Learning rate of 5e-6 ​ Here is a link to the imgur gallery with graphs. Thanks in advance! Handtb submitted by /u/handtb [link] [comments]
    "Safe Deep RL in 3D Environments using Human Feedback", Rahtz et al 2022
    (1 min) submitted by /u/gwern [link] [comments]
    How can i know which actions have the agent in the enviroment in algorithms of Stable-baselines3?
    (1 min) I'm working with the library of Stable-baselines3 (https://github.com/DLR-RM/stable-baselines3) and i've tried with Soft Actor Critic(SAC)i started to use this packages and i have a question about the actions. I know the kind of space in SAC how explaind in (https://stable-baselines3.readthedocs.io/en/master/modules/sac.html) but i would like to know what kind of actions do the agent in the enviroment, specifically with the robotic enviroment "Fetch" in the task of pick and place does somebody have used this package and worked with robotics enviroments in mujoco? submitted by /u/ArtFL-Robotic-1121 [link] [comments]

2022-01-20

  • cs.LG updates on arXiv.org

    On the Expressivity of Markov Reward. (arXiv:2111.00876v2 [cs.LG] UPDATED)
    (2 min) Reward is the driving force for reinforcement-learning agents. This paper is dedicated to understanding the expressivity of reward as a way to capture tasks that we would want an agent to perform. We frame this study around three new abstract notions of "task" that might be desirable: (1) a set of acceptable behaviors, (2) a partial ordering over behaviors, or (3) a partial ordering over trajectories. Our main results prove that while reward can express many of these tasks, there exist instances of each task type that no Markov reward function can capture. We then provide a set of polynomial-time algorithms that construct a Markov reward function that allows an agent to optimize tasks of each of these three types, and correctly determine when no such reward function exists. We conclude with an empirical study that corroborates and illustrates our theoretical findings.
    On the Optimality of the Oja's Algorithm for Online PCA. (arXiv:2104.00512v2 [cs.LG] UPDATED)
    (2 min) In this paper we analyze the behavior of the Oja's algorithm for online/streaming principal component subspace estimation. It is proved that with high probability it performs an efficient, gap-free, global convergence rate to approximate an principal component subspace for any sub-Gaussian distribution. Moreover, it is the first time to show that the convergence rate, namely the upper bound of the approximation, exactly matches the lower bound of an approximation obtained by the offline/classical PCA up to a constant factor.
    Deep Unified Representation for Heterogeneous Recommendation. (arXiv:2201.05861v1 [cs.IR])
    (2 min) Recommendation system has been a widely studied task both in academia and industry. Previous works mainly focus on homogeneous recommendation and little progress has been made for heterogeneous recommender systems. However, heterogeneous recommendations, e.g., recommending different types of items including products, videos, celebrity shopping notes, among many others, are dominant nowadays. State-of-the-art methods are incapable of leveraging attributes from different types of items and thus suffer from data sparsity problems. And it is indeed quite challenging to represent items with different feature spaces jointly. To tackle this problem, we propose a kernel-based neural network, namely deep unified representation (or DURation) for heterogeneous recommendation, to jointly model unified representations of heterogeneous items while preserving their original feature space topology structures. Theoretically, we prove the representation ability of the proposed model. Besides, we conduct extensive experiments on real-world datasets. Experimental results demonstrate that with the unified representation, our model achieves remarkable improvement (e.g., 4.1% ~ 34.9% lift by AUC score and 3.7% lift by online CTR) over existing state-of-the-art models.
    Black-Box Testing of Deep Neural Networks through Test Case Diversity. (arXiv:2112.12591v2 [cs.SE] UPDATED)
    (2 min) Deep Neural Networks (DNNs) have been extensively used in many areas including image processing, medical diagnostics, and autonomous driving. However, DNNs can exhibit erroneous behaviours that may lead to critical errors, especially when used in safety-critical systems. Inspired by testing techniques for traditional software systems, researchers have proposed neuron coverage criteria, as an analogy to source code coverage, to guide the testing of DNN models. Despite very active research on DNN coverage, several recent studies have questioned the usefulness of such criteria in guiding DNN testing. Further, from a practical standpoint, these criteria are white-box as they require access to the internals or training data of DNN models, which is in many contexts not feasible or convenient. In this paper, we investigate black-box input diversity metrics as an alternative to white-box coverage criteria. To this end, we first select and adapt three diversity metrics and study, in a controlled manner, their capacity to measure actual diversity in input sets. We then analyse their statistical association with fault detection using two datasets and three DNN models. We further compare diversity with state-of-the-art white-box coverage criteria. Our experiments show that relying on the diversity of image features embedded in test input sets is a more reliable indicator than coverage criteria to effectively guide the testing of DNNs. Indeed, we found that one of our selected black-box diversity metrics far outperforms existing coverage criteria in terms of fault-revealing capability and computational time. Results also confirm the suspicions that state-of-the-art coverage metrics are not adequate to guide the construction of test input sets to detect as many faults as possible with natural inputs.
    Deep Learning strategies for ProtoDUNE raw data denoising. (arXiv:2103.01596v2 [hep-ph] UPDATED)
    (2 min) In this work, we investigate different machine learning-based strategies for denoising raw simulation data from the ProtoDUNE experiment. The ProtoDUNE detector is hosted by CERN and it aims to test and calibrate the technologies for DUNE, a forthcoming experiment in neutrino physics. The reconstruction workchain consists of converting digital detector signals into physical high-level quantities. We address the first step in reconstruction, namely raw data denoising, leveraging deep learning algorithms. We design two architectures based on graph neural networks, aiming to enhance the receptive field of basic convolutional neural networks. We benchmark this approach against traditional algorithms implemented by the DUNE collaboration. We test the capabilities of graph neural network hardware accelerator setups to speed up training and inference processes.
    Targeted VAE: Variational and Targeted Learning for Causal Inference. (arXiv:2009.13472v5 [stat.ML] UPDATED)
    (2 min) Undertaking causal inference with observational data is incredibly useful across a wide range of tasks including the development of medical treatments, advertisements and marketing, and policy making. There are two significant challenges associated with undertaking causal inference using observational data: treatment assignment heterogeneity (\textit{i.e.}, differences between the treated and untreated groups), and an absence of counterfactual data (\textit{i.e.}, not knowing what would have happened if an individual who did get treatment, were instead to have not been treated). We address these two challenges by combining structured inference and targeted learning. In terms of structure, we factorize the joint distribution into risk, confounding, instrumental, and miscellaneous factors, and in terms of targeted learning, we apply a regularizer derived from the influence curve in order to reduce residual bias. An ablation study is undertaken, and an evaluation on benchmark datasets demonstrates that TVAE has competitive and state of the art performance.
    Faster Rates of Private Stochastic Convex Optimization. (arXiv:2108.00331v3 [cs.LG] UPDATED)
    (2 min) In this paper, we revisit the problem of Differentially Private Stochastic Convex Optimization (DP-SCO) and provide excess population risks for some special classes of functions that are faster than the previous results of general convex and strongly convex functions. In the first part of the paper, we study the case where the population risk function satisfies the Tysbakov Noise Condition (TNC) with some parameter $\theta>1$. Specifically, we first show that under some mild assumptions on the loss functions, there is an algorithm whose output could achieve an upper bound of $\tilde{O}((\frac{1}{\sqrt{n}}+\frac{\sqrt{d\log \frac{1}{\delta}}}{n\epsilon})^\frac{\theta}{\theta-1})$ for $(\epsilon, \delta)$-DP when $\theta\geq 2$, here $n$ is the sample size and $d$ is the dimension of the space. Then we address the inefficiency issue, improve the upper bounds by $\text{Poly}(\log n)$ factors and extend to the case where $\theta\geq \bar{\theta}>1$ for some known $\bar{\theta}$. Next we show that the excess population risk of population functions satisfying TNC with parameter $\theta\geq 2$ is always lower bounded by $\Omega((\frac{d}{n\epsilon})^\frac{\theta}{\theta-1}) $ and $\Omega((\frac{\sqrt{d\log \frac{1}{\delta}}}{n\epsilon})^\frac{\theta}{\theta-1})$ for $\epsilon$-DP and $(\epsilon, \delta)$-DP, respectively. In the second part, we focus on a special case where the population risk function is strongly convex. Unlike the previous studies, here we assume the loss function is {\em non-negative} and {\em the optimal value of population risk is sufficiently small}. With these additional assumptions, we propose a new method whose output could achieve an upper bound of $O(\frac{d\log\frac{1}{\delta}}{n^2\epsilon^2}+\frac{1}{n^{\tau}})$ for any $\tau\geq 1$ in $(\epsilon,\delta)$-DP model if the sample size $n$ is sufficiently large.
    Multi-View representation learning in Multi-Task Scene. (arXiv:2201.05829v1 [cs.CV])
    (2 min) Over recent decades have witnessed considerable progress in whether multi-task learning or multi-view learning, but the situation that consider both learning scenes simultaneously has received not too much attention. How to utilize multiple views latent representation of each single task to improve each learning task performance is a challenge problem. Based on this, we proposed a novel semi-supervised algorithm, termed as Multi-Task Multi-View learning based on Common and Special Features (MTMVCSF). In general, multi-views are the different aspects of an object and every view includes the underlying common or special information of this object. As a consequence, we will mine multiple views jointly latent factor of each learning task which consists of each view special feature and the common feature of all views. By this way, the original multi-task multi-view data has degenerated into multi-task data, and exploring the correlations among multiple tasks enables to make an improvement on the performance of learning algorithm. Another obvious advantage of this approach is that we get latent representation of the set of unlabeled instances by the constraint of regression task with labeled instances. The performance of classification and semi-supervised clustering task in these latent representations perform obviously better than it in raw data. Furthermore, an anti-noise multi-task multi-view algorithm called AN-MTMVCSF is proposed, which has a strong adaptability to noise labels. The effectiveness of these algorithms is proved by a series of well-designed experiments on both real world and synthetic data.
    Spatio-Temporal meets Wavelet: Disentangled Traffic Flow Forecasting via Efficient Spectral Graph Attention Network. (arXiv:2112.02740v2 [cs.LG] UPDATED)
    (2 min) Traffic forecasting is crucial for public safety and resource optimization, yet is very challenging due to three aspects: i) current existing works mostly exploit intricate temporal patterns (e.g., the short-term thunderstorm and long-term daily trends) within a single method, which fail to accurately capture spatio-temporal dependencies under different schemas; ii) the under-exploration of the graph positional encoding limit the extraction of spatial information in the commonly used full graph attention network; iii) the quadratic complexity of the full graph attention introduces heavy computational needs. To achieve the effective traffic flow forecasting, we propose an efficient spectral graph attention network with disentangled traffic sequences. Specifically, the discrete wavelet transform is leveraged to obtain the low- and high-frequency components of traffic sequences, and a dual-channel encoder is elaborately designed to accurately capture the spatio-temporal dependencies under long- and short-term schemas of the low- and high-frequency components. Moreover, a novel wavelet-based graph positional encoding and a query sampling strategy are introduced in our spectral graph attention to effectively guide message passing and efficiently calculate the attention. Extensive experiments on four real-world datasets show the superiority of our model, i.e., the higher traffic forecasting precision with lower computational cost.
    Inspecting state of the art performance and NLP metrics in image-based medical report generation. (arXiv:2011.09257v3 [cs.CL] UPDATED)
    (2 min) Several deep learning architectures have been proposed over the last years to deal with the problem of generating a written report given an imaging exam as input. Most works evaluate the generated reports using standard Natural Language Processing (NLP) metrics (e.g. BLEU, ROUGE), reporting significant progress. In this article, we contrast this progress by comparing state of the art (SOTA) models against weak baselines. We show that simple and even naive approaches yield near SOTA performance on most traditional NLP metrics. We conclude that evaluation methods in this task should be further studied towards correctly measuring clinical accuracy, ideally involving physicians to contribute to this end.
    Improving Adversarial Robustness via Channel-wise Activation Suppressing. (arXiv:2103.08307v2 [cs.LG] UPDATED)
    (2 min) The study of adversarial examples and their activation has attracted significant attention for secure and robust learning with deep neural networks (DNNs). Different from existing works, in this paper, we highlight two new characteristics of adversarial examples from the channel-wise activation perspective: 1) the activation magnitudes of adversarial examples are higher than that of natural examples; and 2) the channels are activated more uniformly by adversarial examples than natural examples. We find that the state-of-the-art defense adversarial training has addressed the first issue of high activation magnitudes via training on adversarial examples, while the second issue of uniform activation remains. This motivates us to suppress redundant activation from being activated by adversarial perturbations via a Channel-wise Activation Suppressing (CAS) strategy. We show that CAS can train a model that inherently suppresses adversarial activation, and can be easily applied to existing defense methods to further improve their robustness. Our work provides a simple but generic training strategy for robustifying the intermediate layer activation of DNNs.
    SMACE: A New Method for the Interpretability of Composite Decision Systems. (arXiv:2111.08749v2 [cs.LG] UPDATED)
    (2 min) Interpretability is a pressing issue for decision systems. Many post hoc methods have been proposed to explain the predictions of a single machine learning model. However, business processes and decision systems are rarely centered around a unique model. These systems combine multiple models that produce key predictions, and then apply decision rules to generate the final decision. To explain such decisions, we propose the Semi-Model-Agnostic Contextual Explainer (SMACE), a new interpretability method that combines a geometric approach for decision rules with existing solutions for machine learning models to generate an intuitive feature ranking tailored to the end user. We show that established model-agnostic approaches produce poor results on tabular data in this setting, in particular giving the same importance to several features, whereas SMACE can rank them in a meaningful way.
    Unsupervised MRI Reconstruction via Zero-Shot Learned Adversarial Transformers. (arXiv:2105.08059v3 [eess.IV] UPDATED)
    (2 min) Supervised reconstruction models are characteristically trained on matched pairs of undersampled and fully-sampled data to capture an MRI prior, along with supervision regarding the imaging operator to enforce data consistency. To reduce supervision requirements, the recent deep image prior framework instead conjoins untrained MRI priors with the imaging operator during inference. Yet, canonical convolutional architectures are suboptimal in capturing long-range relationships, and priors based on randomly initialized networks may yield suboptimal performance. To address these limitations, here we introduce a novel unsupervised MRI reconstruction method based on zero-Shot Learned Adversarial TransformERs (SLATER). SLATER embodies a deep adversarial network with cross-attention transformers to map noise and latent variables onto coil-combined MR images. During pre-training, this unconditional network learns a high-quality MRI prior in an unsupervised generative modeling task. During inference, a zero-shot reconstruction is then performed by incorporating the imaging operator and optimizing the prior to maximize consistency to undersampled data. Comprehensive experiments on brain MRI datasets clearly demonstrate the superior performance of SLATER against state-of-the-art unsupervised methods.
    Deep Indexed Active Learning for Matching Heterogeneous Entity Representations. (arXiv:2104.03986v2 [cs.DB] UPDATED)
    (2 min) Given two large lists of records, the task in entity resolution (ER) is to find the pairs from the Cartesian product of the lists that correspond to the same real world entity. Typically, passive learning methods on such tasks require large amounts of labeled data to yield useful models. Active Learning is a promising approach for ER in low resource settings. However, the search space, to find informative samples for the user to label, grows quadratically for instance-pair tasks making active learning hard to scale. Previous works, in this setting, rely on hand-crafted predicates, pre-trained language model embeddings, or rule learning to prune away unlikely pairs from the Cartesian product. This blocking step can miss out on important regions in the product space leading to low recall. We propose DIAL, a scalable active learning approach that jointly learns embeddings to maximize recall for blocking and accuracy for matching blocked pairs. DIAL uses an Index-By-Committee framework, where each committee member learns representations based on powerful pre-trained transformer language models. We highlight surprising differences between the matcher and the blocker in the creation of the training data and the objective used to train their parameters. Experiments on five benchmark datasets and a multilingual record matching dataset show the effectiveness of our approach in terms of precision, recall and running time. Code is available at https://github.com/ArjitJ/DIAL
    Meta-repository of screening mammography classifiers. (arXiv:2108.04800v3 [cs.LG] UPDATED)
    (2 min) Artificial intelligence (AI) is showing promise in improving clinical diagnosis. In breast cancer screening, recent studies show that AI has the potential to improve early cancer diagnosis and reduce unnecessary workup. As the number of proposed models and their complexity grows, it is becoming increasingly difficult to re-implement them. To enable reproducibility of research and to enable comparison between different methods, we release a meta-repository containing models for classification of screening mammograms. This meta-repository creates a framework that enables the evaluation of AI models on any screening mammography data set. At its inception, our meta-repository contains five state-of-the-art models with open-source implementations and cross-platform compatibility. We compare their performance on seven international data sets. Our framework has a flexible design that can be generalized to other medical image analysis tasks. The meta-repository is available at https://www.github.com/nyukat/mammography_metarepository.
    Particle Dual Averaging: Optimization of Mean Field Neural Networks with Global Convergence Rate Analysis. (arXiv:2012.15477v3 [stat.ML] UPDATED)
    (2 min) We propose the particle dual averaging (PDA) method, which generalizes the dual averaging method in convex optimization to the optimization over probability distributions with quantitative runtime guarantee. The algorithm consists of an inner loop and outer loop: the inner loop utilizes the Langevin algorithm to approximately solve for a stationary distribution, which is then optimized in the outer loop. The method can thus be interpreted as an extension of the Langevin algorithm to naturally handle nonlinear functional on the probability space. An important application of the proposed method is the optimization of neural network in the mean field regime, which is theoretically attractive due to the presence of nonlinear feature learning, but quantitative convergence rate can be challenging to obtain. By adapting finite-dimensional convex optimization theory into the space of measures, we analyze PDA in regularized empirical / expected risk minimization, and establish quantitative global convergence in learning two-layer mean field neural networks under more general settings. Our theoretical results are supported by numerical simulations on neural networks with reasonable size.
    HiRID-ICU-Benchmark -- A Comprehensive Machine Learning Benchmark on High-resolution ICU Data. (arXiv:2111.08536v4 [cs.LG] UPDATED)
    (2 min) The recent success of machine learning methods applied to time series collected from Intensive Care Units (ICU) exposes the lack of standardized machine learning benchmarks for developing and comparing such methods. While raw datasets, such as MIMIC-IV or eICU, can be freely accessed on Physionet, the choice of tasks and pre-processing is often chosen ad-hoc for each publication, limiting comparability across publications. In this work, we aim to improve this situation by providing a benchmark covering a large spectrum of ICU-related tasks. Using the HiRID dataset, we define multiple clinically relevant tasks in collaboration with clinicians. In addition, we provide a reproducible end-to-end pipeline to construct both data and labels. Finally, we provide an in-depth analysis of current state-of-the-art sequence modeling methods, highlighting some limitations of deep learning approaches for this type of data. With this benchmark, we hope to give the research community the possibility of a fair comparison of their work.
    On Designing Good Representation Learning Models. (arXiv:2107.05948v3 [cs.LG] UPDATED)
    (2 min) The goal of representation learning is different from the ultimate objective of machine learning such as decision making, it is therefore very difficult to establish clear and direct objectives for training representation learning models. It has been argued that a good representation should disentangle the underlying variation factors, yet how to translate this into training objectives remains unknown. This paper presents an attempt to establish direct training criterions and design principles for developing good representation learning models. We propose that a good representation learning model should be maximally expressive, i.e., capable of distinguishing the maximum number of input configurations. We formally define expressiveness and introduce the maximum expressiveness (MEXS) theorem of a general learning model. We propose to train a model by maximizing its expressiveness while at the same time incorporating general priors such as model smoothness. We present a conscience competitive learning algorithm which encourages the model to reach its MEXS whilst at the same time adheres to model smoothness prior. We also introduce a label consistent training (LCT) technique to boost model smoothness by encouraging it to assign consistent labels to similar samples. We present extensive experimental results to show that our method can indeed design representation learning models capable of developing representations that are as good as or better than state of the art. We also show that our technique is computationally efficient, robust against different parameter settings and can work effectively on a variety of datasets. Code available at https://github.com/qlilx/odgrlm.git
    A Machine Learning and Computer Vision Approach to Rapidly Optimize Multiscale Droplet Generation. (arXiv:2105.13553v5 [cs.LG] UPDATED)
    (2 min) Generating droplets from a continuous stream of fluid requires precise tuning of a device to find optimized control parameter conditions. It is analytically intractable to compute the necessary control parameter values of a droplet-generating device that produces optimized droplets. Furthermore, as the length scale of the fluid flow changes, the formation physics and optimized conditions that induce flow decomposition into droplets also change. Hence, a single proportional integral derivative controller is too inflexible to optimize devices of different length scales or different control parameters, while classification machine learning techniques take days to train and require millions of droplet images. Therefore, the question is posed, can a single method be created that universally optimizes multiple length-scale droplets using only a few data points and is faster than previous approaches? In this paper, a Bayesian optimization and computer vision feedback loop is designed to quickly and reliably discover the control parameter values that generate optimized droplets within different length-scale devices. This method is demonstrated to converge on optimum parameter values using 60 images in only 2.3 hours, 30x faster than previous approaches. Model implementation is demonstrated for two different length-scale devices: a milliscale inkjet device and a microfluidics device.
    Untargeted Poisoning Attack Detection in Federated Learning via Behavior Attestation. (arXiv:2101.10904v3 [cs.CR] UPDATED)
    (2 min) Federated Learning (FL) is a paradigm in Machine Learning (ML) that addresses data privacy, security, access rights and access to heterogeneous information issues by training a global model using distributed nodes. Despite its advantages, there is an increased potential for cyberattacks on FL-based ML techniques that can undermine the benefits. Model-poisoning attacks on FL target the availability of the model. The adversarial objective is to disrupt the training. We propose attestedFL, a defense mechanism that monitors the training of individual nodes through state persistence in order to detect a malicious worker. A fine-grained assessment of the history of the worker permits the evaluation of its behavior in time and results in innovative detection strategies. We present three lines of defense that aim at assessing if the worker is reliable by observing if the node is really training, advancing towards a goal. Our defense exposes an attacker's malicious behavior and removes unreliable nodes from the aggregation process so that the FL process converge faster. Through extensive evaluations and against various adversarial settings, attestedFL increased the accuracy of the model between 12% to 58% under different scenarios such as attacks performed at different stages of convergence, attackers colluding and continuous attacks.
    Robot Learning from Randomized Simulations: A Review. (arXiv:2111.00956v2 [cs.RO] UPDATED)
    (2 min) The rise of deep learning has caused a paradigm shift in robotics research, favoring methods that require large amounts of data. Unfortunately, it is prohibitively expensive to generate such data sets on a physical platform. Therefore, state-of-the-art approaches learn in simulation where data generation is fast as well as inexpensive and subsequently transfer the knowledge to the real robot (sim-to-real). Despite becoming increasingly realistic, all simulators are by construction based on models, hence inevitably imperfect. This raises the question of how simulators can be modified to facilitate learning robot control policies and overcome the mismatch between simulation and reality, often called the 'reality gap'. We provide a comprehensive review of sim-to-real research for robotics, focusing on a technique named 'domain randomization' which is a method for learning from randomized simulations.
    GNMR: A provable one-line algorithm for low rank matrix recovery. (arXiv:2106.12933v2 [math.OC] UPDATED)
    (2 min) Low rank matrix recovery problems, including matrix completion and matrix sensing, appear in a broad range of applications. In this work we present GNMR -- an extremely simple iterative algorithm for low rank matrix recovery, based on a Gauss-Newton linearization. On the theoretical front, we derive recovery guarantees for GNMR in both the matrix sensing and matrix completion settings. Some of these results improve upon the best currently known for other methods. A key property of GNMR is that it implicitly keeps the factor matrices approximately balanced throughout its iterations. On the empirical front, we show that for matrix completion with uniform sampling, GNMR performs better than several popular methods, especially when given very few observations close to the information limit.
    Reinforcement Learning for Ridesharing: An Extended Survey. (arXiv:2105.01099v3 [cs.LG] UPDATED)
    (2 min) In this paper, we present a comprehensive, in-depth survey of the literature on reinforcement learning approaches to decision optimization problems in a typical ridesharing system. Papers on the topics of rideshare matching, vehicle repositioning, ride-pooling, routing, and dynamic pricing are covered. Popular data sets and open simulation environments are also introduced. Subsequently, we discuss a number of challenges and opportunities for reinforcement learning research on this important domain.
    How To Train Your Program: a Probabilistic Programming Pattern for Bayesian Learning From Data. (arXiv:2105.03650v2 [cs.LG] UPDATED)
    (2 min) We present a Bayesian approach to machine learning with probabilistic programs. In our approach, training on available data is implemented as inference on a hierarchical model. The posterior distribution of model parameters is then used to \textit{stochastically condition} a complementary model, such that inference on new data yields the same posterior distribution of latent parameters corresponding to the new data as inference on a hierachical model on the combination of both previously available and new data, at a lower computation cost. We frame the approach as a design pattern of probabilistic programming referred to herein as `stump and fungus', and evaluate realization of the pattern on case studies.
    Credit Assignment in Neural Networks through Deep Feedback Control. (arXiv:2106.07887v3 [cs.LG] UPDATED)
    (2 min) The success of deep learning sparked interest in whether the brain learns by using similar techniques for assigning credit to each synaptic weight for its contribution to the network output. However, the majority of current attempts at biologically-plausible learning methods are either non-local in time, require highly specific connectivity motives, or have no clear link to any known mathematical optimization method. Here, we introduce Deep Feedback Control (DFC), a new learning method that uses a feedback controller to drive a deep neural network to match a desired output target and whose control signal can be used for credit assignment. The resulting learning rule is fully local in space and time and approximates Gauss-Newton optimization for a wide range of feedback connectivity patterns. To further underline its biological plausibility, we relate DFC to a multi-compartment model of cortical pyramidal neurons with a local voltage-dependent synaptic plasticity rule, consistent with recent theories of dendritic processing. By combining dynamical system theory with mathematical optimization theory, we provide a strong theoretical foundation for DFC that we corroborate with detailed results on toy experiments and standard computer-vision benchmarks.
    Iterative Causal Discovery in the Possible Presence of Latent Confounders and Selection Bias. (arXiv:2111.04095v2 [cs.LG] UPDATED)
    (2 min) We present a sound and complete algorithm, called iterative causal discovery (ICD), for recovering causal graphs in the presence of latent confounders and selection bias. ICD relies on the causal Markov and faithfulness assumptions and recovers the equivalence class of the underlying causal graph. It starts with a complete graph, and consists of a single iterative stage that gradually refines this graph by identifying conditional independence (CI) between connected nodes. Independence and causal relations entailed after any iteration are correct, rendering ICD anytime. Essentially, we tie the size of the CI conditioning set to its distance on the graph from the tested nodes, and increase this value in the successive iteration. Thus, each iteration refines a graph that was recovered by previous iterations having smaller conditioning sets -- a higher statistical power -- which contributes to stability. We demonstrate empirically that ICD requires significantly fewer CI tests and learns more accurate causal graphs compared to FCI, FCI+, and RFCI algorithms (code is available at https://github.com/IntelLabs/causality-lab).
    On Centralized and Distributed Mirror Descent: Convergence Analysis Using Quadratic Constraints. (arXiv:2105.14385v2 [math.OC] UPDATED)
    (2 min) Mirror descent (MD) is a powerful first-order optimization technique that subsumes several optimization algorithms including gradient descent (GD). In this work, we develop a semi-definite programming (SDP) framework to analyze the convergence rate of MD in centralized and distributed settings under both strongly convex and non-strongly convex assumptions. We view MD with a dynamical system lens and leverage quadratic constraints (QCs) to provide explicit convergence rates based on Lyapunov stability. For centralized MD under strongly convex assumption, we develop a SDP that certifies exponential convergence rates. We prove that the SDP always has a feasible solution that recovers the optimal GD rate as a special case. We complement our analysis by providing the $O(1/k)$ convergence rate for convex problems. Next, we analyze the convergence of distributed MD and characterize the rate using SDP. To the best of our knowledge, the numerical rate of distributed MD has not been previously reported in the literature. We further prove an $O(1/k)$ convergence rate for distributed MD in the convex setting. Our numerical experiments on strongly convex problems indicate that our framework certifies superior convergence rates compared to the existing rates for distributed GD.
    How Modular Should Neural Module Networks Be for Systematic Generalization?. (arXiv:2106.08170v2 [cs.LG] UPDATED)
    (2 min) Neural Module Networks (NMNs) aim at Visual Question Answering (VQA) via composition of modules that tackle a sub-task. NMNs are a promising strategy to achieve systematic generalization, i.e., overcoming biasing factors in the training distribution. However, the aspects of NMNs that facilitate systematic generalization are not fully understood. In this paper, we demonstrate that the degree of modularity of the NMN have large influence on systematic generalization. In a series of experiments on three VQA datasets (VQA-MNIST, SQOOP, and CLEVR-CoGenT), our results reveal that tuning the degree of modularity, especially at the image encoder stage, reaches substantially higher systematic generalization. These findings lead to new NMN architectures that outperform previous ones in terms of systematic generalization.
    Bandit Data-Driven Optimization. (arXiv:2008.11707v2 [cs.LG] UPDATED)
    (2 min) Applications of machine learning in the non-profit and public sectors often feature an iterative workflow of data acquisition, prediction, and optimization of interventions. There are four major pain points that a machine learning pipeline must overcome in order to be actually useful in these settings: small data, data collected only under the default intervention, unmodeled objectives due to communication gap, and unforeseen consequences of the intervention. In this paper, we introduce bandit data-driven optimization, the first iterative prediction-prescription framework to address these pain points. Bandit data-driven optimization combines the advantages of online bandit learning and offline predictive analytics in an integrated framework. We propose PROOF, a novel algorithm for this framework and formally prove that it has no-regret. Using numerical simulations, we show that PROOF achieves superior performance than existing baseline. We also apply PROOF in a detailed case study of food rescue volunteer recommendation, and show that PROOF as a framework works well with the intricacies of ML models in real-world AI for non-profit and public sector applications.
    MutFormer: A context-dependent transformer-based model to predict pathogenic missense mutations. (arXiv:2110.14746v2 [q-bio.GN] UPDATED)
    (0 min) A missense mutation is a point mutation that results in a substitution of an amino acid in a protein sequence. Currently, missense mutations account for over half of the known variants responsible for Mendelian diseases, but accurate prediction of the pathogenicity of missense variants remains challenging. Recent advances show transformer models - a type of deep neural network - to be particularly powerful at modeling sequence information with context dependence, such as language translation tasks. In this study, we introduce MutFormer, a transformer-based model for the prediction of pathogenic missense mutations, using reference and mutated amino acid sequences from the human genome as the features. We first pre-trained MutFormer on reference protein sequences and mutated protein sequences resulting from common genetic variants (assumed to be benign). We next tested different fine-tuning methods for pathogenicity prediction, and determined that utilizing paired-protein sequences (concatenation of the wildtype and mutated protein sequence) yields the most optimal results. Our results show that MutFormer outperforms a variety of existing tools in pathogenicity prediction, including those using conventional machine-learning approaches (e.g., support vector machine, convolutional neural network, recurrent neural network, etc.), and may complement existing computational predictions and empirically generated functional scores to improve our understanding of disease variants. MutFormer and pre-computed variant scores are publicly available on GitHub at https://github.com/WGLab/mutformer.
    Disentangling Transfer and Interference in Multi-Domain Learning. (arXiv:2107.05445v4 [cs.CV] UPDATED)
    (2 min) Humans are incredibly good at transferring knowledge from one domain to another, enabling rapid learning of new tasks. Likewise, transfer learning has enabled enormous success in many computer vision problems using pretraining. However, the benefits of transfer in multi-domain learning, where a network learns multiple tasks defined by different datasets, has not been adequately studied. Learning multiple domains could be beneficial, or these domains could interfere with each other given limited network capacity. Understanding how deep neural networks of varied capacity facilitate transfer across inputs from different distributions is a critical step towards open world learning. In this work, we decipher the conditions where interference and knowledge transfer occur in multi-domain learning. We propose new metrics disentangling interference and transfer, set up experimental protocols, and examine the roles of network capacity, task grouping, and dynamic loss weighting in reducing interference and facilitating transfer.
    FlexMatch: Boosting Semi-Supervised Learning with Curriculum Pseudo Labeling. (arXiv:2110.08263v2 [cs.LG] UPDATED)
    (2 min) The recently proposed FixMatch achieved state-of-the-art results on most semi-supervised learning (SSL) benchmarks. However, like other modern SSL algorithms, FixMatch uses a pre-defined constant threshold for all classes to select unlabeled data that contribute to the training, thus failing to consider different learning status and learning difficulties of different classes. To address this issue, we propose Curriculum Pseudo Labeling (CPL), a curriculum learning approach to leverage unlabeled data according to the model's learning status. The core of CPL is to flexibly adjust thresholds for different classes at each time step to let pass informative unlabeled data and their pseudo labels. CPL does not introduce additional parameters or computations (forward or backward propagation). We apply CPL to FixMatch and call our improved algorithm FlexMatch. FlexMatch achieves state-of-the-art performance on a variety of SSL benchmarks, with especially strong performances when the labeled data are extremely limited or when the task is challenging. For example, FlexMatch achieves 13.96% and 18.96% error rate reduction over FixMatch on CIFAR-100 and STL-10 datasets respectively, when there are only 4 labels per class. CPL also significantly boosts the convergence speed, e.g., FlexMatch can use only 1/5 training time of FixMatch to achieve even better performance. Furthermore, we show that CPL can be easily adapted to other SSL algorithms and remarkably improve their performances. We open-source our code at https://github.com/TorchSSL/TorchSSL.
    Provably Efficient Multi-Task Reinforcement Learning with Model Transfer. (arXiv:2107.08622v2 [cs.LG] UPDATED)
    (2 min) We study multi-task reinforcement learning (RL) in tabular episodic Markov decision processes (MDPs). We formulate a heterogeneous multi-player RL problem, in which a group of players concurrently face similar but not necessarily identical MDPs, with a goal of improving their collective performance through inter-player information sharing. We design and analyze an algorithm based on the idea of model transfer, and provide gap-dependent and gap-independent upper and lower bounds that characterize the intrinsic complexity of the problem.
    Learning Optimal Predictive Checklists. (arXiv:2112.01020v2 [cs.LG] UPDATED)
    (2 min) Checklists are simple decision aids that are often used to promote safety and reliability in clinical applications. In this paper, we present a method to learn checklists for clinical decision support. We represent predictive checklists as discrete linear classifiers with binary features and unit weights. We then learn globally optimal predictive checklists from data by solving an integer programming problem. Our method allows users to customize checklists to obey complex constraints, including constraints to enforce group fairness and to binarize real-valued features at training time. In addition, it pairs models with an optimality gap that can inform model development and determine the feasibility of learning sufficiently accurate checklists on a given dataset. We pair our method with specialized techniques that speed up its ability to train a predictive checklist that performs well and has a small optimality gap. We benchmark the performance of our method on seven clinical classification problems, and demonstrate its practical benefits by training a short-form checklist for PTSD screening. Our results show that our method can fit simple predictive checklists that perform well and that can easily be customized to obey a rich class of custom constraints.
    Long-Short Ensemble Network for Bipolar Manic-Euthymic State Recognition Based on Wrist-worn Sensors. (arXiv:2107.00710v2 [cs.LG] UPDATED)
    (2 min) Manic episodes of bipolar disorder can lead to uncritical behaviour and delusional psychosis, often with destructive consequences for those affected and their surroundings. Early detection and intervention of a manic episode are crucial to prevent escalation, hospital admission and premature death. However, people with bipolar disorder may not recognize that they are experiencing a manic episode and symptoms such as euphoria and increased productivity can also deter affected individuals from seeking help. This work proposes to perform user-independent, automatic mood-state detection based on actigraphy and electrodermal activity acquired from a wrist-worn device during mania and after recovery (euthymia). This paper proposes a new deep learning-based ensemble method leveraging long (20h) and short (5 minutes) time-intervals to discriminate between the mood-states. When tested on 47 bipolar patients, the proposed classification scheme achieves an average accuracy of 91.59% in euthymic/manic mood-state recognition.
    Perfect density models cannot guarantee anomaly detection. (arXiv:2012.03808v3 [cs.LG] UPDATED)
    (2 min) Thanks to the tractability of their likelihood, several deep generative models show promise for seemingly straightforward but important applications like anomaly detection, uncertainty estimation, and active learning. However, the likelihood values empirically attributed to anomalies conflict with the expectations these proposed applications suggest. In this paper, we take a closer look at the behavior of distribution densities through the lens of reparametrization and show that these quantities carry less meaningful information than previously thought, beyond estimation issues or the curse of dimensionality. We conclude that the use of these likelihoods for anomaly detection relies on strong and implicit hypotheses, and highlight the necessity of explicitly formulating these assumptions for reliable anomaly detection.
    Neural Latents Benchmark '21: Evaluating latent variable models of neural population activity. (arXiv:2109.04463v4 [cs.LG] UPDATED)
    (2 min) Advances in neural recording present increasing opportunities to study neural activity in unprecedented detail. Latent variable models (LVMs) are promising tools for analyzing this rich activity across diverse neural systems and behaviors, as LVMs do not depend on known relationships between the activity and external experimental variables. However, progress with LVMs for neuronal population activity is currently impeded by a lack of standardization, resulting in methods being developed and compared in an ad hoc manner. To coordinate these modeling efforts, we introduce a benchmark suite for latent variable modeling of neural population activity. We curate four datasets of neural spiking activity from cognitive, sensory, and motor areas to promote models that apply to the wide variety of activity seen across these areas. We identify unsupervised evaluation as a common framework for evaluating models across datasets, and apply several baselines that demonstrate benchmark diversity. We release this benchmark through EvalAI. this http URL
    Fractional SDE-Net: Generation of Time Series Data with Long-term Memory. (arXiv:2201.05974v1 [cs.LG])
    (2 min) In this paper, we focus on generation of time-series data using neural networks. It is often the case that input time-series data, especially taken from real financial markets, is irregularly sampled, and its noise structure is more complicated than i.i.d. type. To generate time series with such a property, we propose fSDE-Net: neural fractional Stochastic Differential Equation Network. It generalizes the neural SDE model by using fractional Brownian motion with Hurst index larger than half, which exhibits long-term memory property. We derive the solver of fSDE-Net and theoretically analyze the existence and uniqueness of the solution to fSDE-Net. Our experiments demonstrate that the fSDE-Net model can replicate distributional properties well.
    A new Sinkhorn algorithm with Deletion and Insertion operations. (arXiv:2111.14565v2 [cs.LG] UPDATED)
    (0 min) This technical report is devoted to the continuous estimation of an epsilon-assignment. Roughly speaking, an epsilon assignment between two sets V1 and V2 may be understood as a bijective mapping between a sub part of V1 and a sub part of V2 . The remaining elements of V1 (not included in this mapping) are mapped onto an epsilon pseudo element of V2 . We say that such elements are deleted. Conversely, the remaining elements of V2 correspond to the image of the epsilon pseudo element of V1. We say that these elements are inserted. As a result our method provides a result similar to the one of the Sinkhorn algorithm with the additional ability to reject some elements which are either inserted or deleted. It thus naturally handles sets V1 and V2 of different sizes and decides mappings/insertions/deletions in a unified way. Our algorithms are iterative and differentiable and may thus be easily inserted within a backpropagation based learning framework such as artificial neural networks.
    Proxy-Normalizing Activations to Match Batch Normalization while Removing Batch Dependence. (arXiv:2106.03743v5 [cs.LG] UPDATED)
    (2 min) We investigate the reasons for the performance degradation incurred with batch-independent normalization. We find that the prototypical techniques of layer normalization and instance normalization both induce the appearance of failure modes in the neural network's pre-activations: (i) layer normalization induces a collapse towards channel-wise constant functions; (ii) instance normalization induces a lack of variability in instance statistics, symptomatic of an alteration of the expressivity. To alleviate failure mode (i) without aggravating failure mode (ii), we introduce the technique "Proxy Normalization" that normalizes post-activations using a proxy distribution. When combined with layer normalization or group normalization, this batch-independent normalization emulates batch normalization's behavior and consistently matches or exceeds its performance.
    Machine Learning for Mechanical Ventilation Control. (arXiv:2102.06779v4 [cs.LG] UPDATED)
    (0 min) We consider the problem of controlling an invasive mechanical ventilator for pressure-controlled ventilation: a controller must let air in and out of a sedated patient's lungs according to a trajectory of airway pressures specified by a clinician. Hand-tuned PID controllers and similar variants have comprised the industry standard for decades, yet can behave poorly by over- or under-shooting their target or oscillating rapidly. We consider a data-driven machine learning approach: First, we train a simulator based on data we collect from an artificial lung. Then, we train deep neural network controllers on these simulators.We show that our controllers are able to track target pressure waveforms significantly better than PID controllers. We further show that a learned controller generalizes across lungs with varying characteristics much more readily than PID controllers do.
    Score-based Generative Neural Networks for Large-Scale Optimal Transport. (arXiv:2110.03237v3 [cs.LG] UPDATED)
    (0 min) We consider the fundamental problem of sampling the optimal transport coupling between given source and target distributions. In certain cases, the optimal transport plan takes the form of a one-to-one mapping from the source support to the target support, but learning or even approximating such a map is computationally challenging for large and high-dimensional datasets due to the high cost of linear programming routines and an intrinsic curse of dimensionality. We study instead the Sinkhorn problem, a regularized form of optimal transport whose solutions are couplings between the source and the target distribution. We introduce a novel framework for learning the Sinkhorn coupling between two distributions in the form of a score-based generative model. Conditioned on source data, our procedure iterates Langevin Dynamics to sample target data according to the regularized optimal coupling. Key to this approach is a neural network parametrization of the Sinkhorn problem, and we prove convergence of gradient descent with respect to network parameters in this formulation. We demonstrate its empirical success on a variety of large scale optimal transport tasks.
    Hierarchical VAEs Know What They Don't Know. (arXiv:2102.08248v7 [cs.LG] UPDATED)
    (0 min) Deep generative models have been demonstrated as state-of-the-art density estimators. Yet, recent work has found that they often assign a higher likelihood to data from outside the training distribution. This seemingly paradoxical behavior has caused concerns over the quality of the attained density estimates. In the context of hierarchical variational autoencoders, we provide evidence to explain this behavior by out-of-distribution data having in-distribution low-level features. We argue that this is both expected and desirable behavior. With this insight in hand, we develop a fast, scalable and fully unsupervised likelihood-ratio score for OOD detection that requires data to be in-distribution across all feature-levels. We benchmark the method on a vast set of data and model combinations and achieve state-of-the-art results on out-of-distribution detection.
    ViTBIS: Vision Transformer for Biomedical Image Segmentation. (arXiv:2201.05920v1 [eess.IV])
    (0 min) In this paper, we propose a novel network named Vision Transformer for Biomedical Image Segmentation (ViTBIS). Our network splits the input feature maps into three parts with $1\times 1$, $3\times 3$ and $5\times 5$ convolutions in both encoder and decoder. Concat operator is used to merge the features before being fed to three consecutive transformer blocks with attention mechanism embedded inside it. Skip connections are used to connect encoder and decoder transformer blocks. Similarly, transformer blocks and multi scale architecture is used in decoder before being linearly projected to produce the output segmentation map. We test the performance of our network using Synapse multi-organ segmentation dataset, Automated cardiac diagnosis challenge dataset, Brain tumour MRI segmentation dataset and Spleen CT segmentation dataset. Without bells and whistles, our network outperforms most of the previous state of the art CNN and transformer based models using Dice score and the Hausdorff distance as the evaluation metrics.
    BERTMap: A BERT-based Ontology Alignment System. (arXiv:2112.02682v3 [cs.AI] UPDATED)
    (0 min) Ontology alignment (a.k.a ontology matching (OM)) plays a critical role in knowledge integration. Owing to the success of machine learning in many domains, it has been applied in OM. However, the existing methods, which often adopt ad-hoc feature engineering or non-contextual word embeddings, have not yet outperformed rule-based systems especially in an unsupervised setting. In this paper, we propose a novel OM system named BERTMap which can support both unsupervised and semi-supervised settings. It first predicts mappings using a classifier based on fine-tuning the contextual embedding model BERT on text semantics corpora extracted from ontologies, and then refines the mappings through extension and repair by utilizing the ontology structure and logic. Our evaluation with three alignment tasks on biomedical ontologies demonstrates that BERTMap can often perform better than the leading OM systems LogMap and AML.
    Deep learning via LSTM models for COVID-19 infection forecasting in India. (arXiv:2101.11881v2 [cs.LG] UPDATED)
    (0 min) The COVID-19 pandemic continues to have major impact to health and medical infrastructure, economy, and agriculture. Prominent computational and mathematical models have been unreliable due to the complexity of the spread of infections. Moreover, lack of data collection and reporting makes modelling attempts difficult and unreliable. Hence, we need to re-look at the situation with reliable data sources and innovative forecasting models. Deep learning models such as recurrent neural networks are well suited for modelling spatiotemporal sequences. In this paper, we apply recurrent neural networks such as long short term memory (LSTM), bidirectional LSTM, and encoder-decoder LSTM models for multi-step (short-term) COVID-19 infection forecasting. We select Indian states with COVID-19 hotpots and capture the first (2020) and second (2021) wave of infections and provide two months ahead forecast. Our model predicts that the likelihood of another wave of infections in October and November 2021 is low; however, the authorities need to be vigilant given emerging variants of the virus. The accuracy of the predictions motivate the application of the method in other countries and regions. Nevertheless, the challenges in modelling remain due to the reliability of data and difficulties in capturing factors such as population density, logistics, and social aspects such as culture and lifestyle.
    Towards Communication-Efficient and Privacy-Preserving Federated Representation Learning. (arXiv:2109.14611v2 [cs.LG] UPDATED)
    (2 min) This paper investigates the feasibility of federated representation learning under the constraints of communication cost and privacy protection. Existing works either conduct annotation-guided local training which requires frequent communication or aggregates the client models via weight averaging which has potential risks of privacy exposure. To tackle the above problems, we first identify that self-supervised contrastive local training is robust against the non-identically distributed data, which provides the feasibility of longer local training and thus reduces the communication cost. Then based on the aforementioned robustness, we propose a novel Federated representation Learning framework with Ensemble Similarity Distillation~(FLESD) that utilizes this robustness. At each round of communication, the server first gathers a fraction of the clients' inferred similarity matrices on a public dataset. Then it ensembles the similarity matrices and train the global model via similarity distillation. We verify the effectiveness of FLESD by a series of empirical experiments and show that, despite stricter constraints, it achieves comparable results under multiple settings on multiple datasets.
    Training Spiking Neural Networks Using Lessons From Deep Learning. (arXiv:2109.12894v4 [cs.NE] UPDATED)
    (2 min) The brain is the perfect place to look for inspiration to develop more efficient neural networks. The inner workings of our synapses and neurons provide a glimpse at what the future of deep learning might look like. This paper serves as a tutorial and perspective showing how to apply the lessons learnt from several decades of research in deep learning, gradient descent, backpropagation and neuroscience to biologically plausible spiking neural neural networks. We also explore the delicate interplay between encoding data as spikes and the learning process; the challenges and solutions of applying gradient-based learning to spiking neural networks; the subtle link between temporal backpropagation and spike timing dependent plasticity, and how deep learning might move towards biologically plausible online learning. Some ideas are well accepted and commonly used amongst the neuromorphic engineering community, while others are presented or justified for the first time here. A series of companion interactive tutorials complementary to this paper using our Python package, snnTorch, are also made available: https://snntorch.readthedocs.io/en/latest/tutorials/index.html
    Robust Aggregation for Federated Learning. (arXiv:1912.13445v2 [stat.ML] UPDATED)
    (2 min) Federated learning is the centralized training of statistical models from decentralized data on mobile devices while preserving the privacy of each device. We present a robust aggregation approach to make federated learning robust to settings when a fraction of the devices may be sending corrupted updates to the server. The approach relies on a robust aggregation oracle based on the geometric median, which returns a robust aggregate using a constant number of iterations of a regular non-robust averaging oracle. The robust aggregation oracle is privacy-preserving, similar to the non-robust secure average oracle it builds upon. We establish its convergence for least squares estimation of additive models. We provide experimental results with linear models and deep networks for three tasks in computer vision and natural language processing. The robust aggregation approach is agnostic to the level of corruption; it outperforms the classical aggregation approach in terms of robustness when the level of corruption is high, while being competitive in the regime of low corruption. Two variants, a faster one with one-step robust aggregation and another one with on-device personalization, round off the paper.
    Labeling Trick: A Theory of Using Graph Neural Networks for Multi-Node Representation Learning. (arXiv:2010.16103v5 [cs.LG] UPDATED)
    (3 min) In this paper, we provide a theory of using graph neural networks (GNNs) for multi-node representation learning (where we are interested in learning a representation for a set of more than one node, such as link). We know that GNN is designed to learn single-node representations. When we want to learn a node set representation involving multiple nodes, a common practice in previous works is to directly aggregate the single-node representations obtained by a GNN into a joint node set representation. In this paper, we show a fundamental constraint of such an approach, namely the inability to capture the dependence between nodes in the node set, and argue that directly aggregating individual node representations does not lead to an effective joint representation for multiple nodes. Then, we notice that a few previous successful works for multi-node representation learning, including SEAL, Distance Encoding, and ID-GNN, all used node labeling. These methods first label nodes in the graph according to their relationships with the target node set before applying a GNN. Then, the node representations obtained in the labeled graph are aggregated into a node set representation. By investigating their inner mechanisms, we unify these node labeling techniques into a single and most general form -- labeling trick. We prove that with labeling trick a sufficiently expressive GNN learns the most expressive node set representations, thus in principle solves any joint learning tasks over node sets. Experiments on one important two-node representation learning task, link prediction, verified our theory. Our work explains the superior performance of previous node-labeling-based methods, and establishes a theoretical foundation of using GNNs for multi-node representation learning.
    Algorithmic encoding of protected characteristics in image-based models for disease detection. (arXiv:2110.14755v3 [cs.LG] UPDATED)
    (2 min) It has been rightfully emphasized that the use of AI for clinical decision making could amplify health disparities. A machine learning model may pick up undesirable correlations, for example, between a patient's racial identity and clinical outcome. Such correlations are often present in (historical) data used for model development. There has been an increase in studies reporting biases in image-based disease detection models. Besides the scarcity of data from underserved populations, very little is known about how these biases are encoded and how one may reduce or even remove disparate performance. There are concerns that an algorithm may recognize patient characteristics such as biological sex or racial identity, and then directly or indirectly use this information when making predictions. But it remains unclear how we can establish whether such information is actually used. This article aims to shed some light on these issues by exploring different methodology for assessing the inner working of disease detection models. We explore multitask learning and model inspection to assess the relationship between protected characteristics and prediction of disease. We believe our analysis framework could provide valuable insights in future studies in medical imaging AI. Our findings also call for further research to better understand the underlying causes of performance disparities.
    Regularising Inverse Problems with Generative Machine Learning Models. (arXiv:2107.11191v2 [eess.IV] UPDATED)
    (2 min) Deep neural network approaches to inverse imaging problems have produced impressive results in the last few years. In this paper, we consider the use of generative models in a variational regularisation approach to inverse problems. The considered regularisers penalise images that are far from the range of a generative model that has learned to produce images similar to a training dataset. We name this family \textit{generative regularisers}. The success of generative regularisers depends on the quality of the generative model and so we propose a set of desired criteria to assess models and guide future research. In our numerical experiments, we evaluate three common generative models, autoencoders, variational autoencoders and generative adversarial networks, against our desired criteria. We also test three different generative regularisers on the inverse problems of deblurring, deconvolution, and tomography. We show that the success of solutions restricted to lie exactly in the range of the generator is highly dependent on the ability of the generative model but that allowing small deviations from the range of the generator produces more consistent results.
    A Sparse Model-inspired Deep Thresholding Network for Exponential Signal Reconstruction -- Application in Fast Biological Spectroscopy. (arXiv:2012.14830v5 [cs.LG] UPDATED)
    (2 min) The non-uniform sampling is a powerful approach to enable fast acquisition but requires sophisticated reconstruction algorithms. Faithful reconstruction from partial sampled exponentials is highly expected in general signal processing and many applications. Deep learning has shown astonishing potential in this field but many existing problems, such as lack of robustness and explainability, greatly limit its applications. In this work, by combining merits of the sparse model-based optimization method and data-driven deep learning, we propose a deep learning architecture for spectra reconstruction from undersampled data, called MoDern. It follows the iterative reconstruction in solving a sparse model to build the neural network and we elaborately design a learnable soft-thresholding to adaptively eliminate the spectrum artifacts introduced by undersampling. Extensive results on both synthetic and biological data show that MoDern enables more robust, high-fidelity, and ultra-fast reconstruction than the state-of-the-art methods. Remarkably, MoDern has a small number of network parameters and is trained on solely synthetic data while generalizing well to biological data in various scenarios. Furthermore, we extend it to an open-access and easy-to-use cloud computing platform (XCloud-MoDern), contributing a promising strategy for further development of biological applications.
    A three layer neural network can represent any multivariate function. (arXiv:2012.03016v2 [cs.LG] UPDATED)
    (2 min) In 1987, Hecht-Nielsen showed that any continuous multivariate function can be implemented by a certain type three-layer neural network. This result was very much discussed in neural network literature. In this paper we prove that not only continuous functions but also all discontinuous functions can be implemented by such neural networks.
    On the Potential of Execution Traces for Batch Processing Workload Optimization in Public Clouds. (arXiv:2111.08759v2 [cs.DC] UPDATED)
    (2 min) With the growing amount of data, data processing workloads and the management of their resource usage becomes increasingly important. Since managing a dedicated infrastructure is in many situations infeasible or uneconomical, users progressively execute their respective workloads in the cloud. As the configuration of workloads and resources is often challenging, various methods have been proposed that either quickly profile towards a good configuration or determine one based on data from previous runs. Still, performance data to train such methods is often lacking and must be costly collected. In this paper, we propose a collaborative approach for sharing anonymized workload execution traces among users, mining them for general patterns, and exploiting clusters of historical workloads for future optimizations. We evaluate our prototype implementation for mining workload execution graphs on a publicly available trace dataset and demonstrate the predictive value of workload clusters determined through traces only.
    Reward Machines: Exploiting Reward Function Structure in Reinforcement Learning. (arXiv:2010.03950v2 [cs.LG] UPDATED)
    (2 min) Reinforcement learning (RL) methods usually treat reward functions as black boxes. As such, these methods must extensively interact with the environment in order to discover rewards and optimal policies. In most RL applications, however, users have to program the reward function and, hence, there is the opportunity to make the reward function visible -- to show the reward function's code to the RL agent so it can exploit the function's internal structure to learn optimal policies in a more sample efficient manner. In this paper, we show how to accomplish this idea in two steps. First, we propose reward machines, a type of finite state machine that supports the specification of reward functions while exposing reward function structure. We then describe different methodologies to exploit this structure to support learning, including automated reward shaping, task decomposition, and counterfactual reasoning with off-policy learning. Experiments on tabular and continuous domains, across different tasks and RL agents, show the benefits of exploiting reward structure with respect to sample efficiency and the quality of resultant policies. Finally, by virtue of being a form of finite state machine, reward machines have the expressive power of a regular language and as such support loops, sequences and conditionals, as well as the expression of temporally extended properties typical of linear temporal logic and non-Markovian reward specification.
    Cooperative Multi-Agent Deep Reinforcement Learning for Reliable Surveillance via Autonomous Multi-UAV Control. (arXiv:2201.05843v1 [eess.SY])
    (2 min) CCTV-based surveillance using unmanned aerial vehicles (UAVs) is considered a key technology for security in smart city environments. This paper creates a case where the UAVs with CCTV-cameras fly over the city area for flexible and reliable surveillance services. UAVs should be deployed to cover a large area while minimize overlapping and shadow areas for a reliable surveillance system. However, the operation of UAVs is subject to high uncertainty, necessitating autonomous recovery systems. This work develops a multi-agent deep reinforcement learning-based management scheme for reliable industry surveillance in smart city applications. The core idea this paper employs is autonomously replenishing the UAV's deficient network requirements with communications. Via intensive simulations, our proposed algorithm outperforms the state-of-the-art algorithms in terms of surveillance coverage, user support capability, and computational costs.
    Dynamic Differential-Privacy Preserving SGD. (arXiv:2111.00173v3 [cs.LG] UPDATED)
    (2 min) The vanilla Differentially-Private Stochastic Gradient Descent (DP-SGD), including DP-Adam and other variants, ensures the privacy of training data by uniformly distributing privacy costs across training steps. The equivalent privacy costs controlled by maintaining the same gradient clipping thresholds and noise powers in each step result in unstable updates and a lower model accuracy when compared to the non-DP counterpart. In this paper, we propose the dynamic DP-SGD (along with dynamic DP-Adam, and others) to reduce the performance loss gap while maintaining privacy by dynamically adjusting clipping thresholds and noise powers while adhering to a total privacy budget constraint. Extensive experiments on a variety of deep learning tasks, including image classification, natural language processing, and federated learning, demonstrate that the proposed dynamic DP-SGD algorithm stabilizes updates and, as a result, significantly improves model accuracy in the strong privacy protection region when compared to the vanilla DP-SGD. We also conduct theoretical analysis to better understand the privacy-utility trade-off with dynamic DP-SGD, as well as to learn why Dynamic DP-SGD can outperform vanilla DP-SGD.
    Uniform Sampling over Episode Difficulty. (arXiv:2108.01662v2 [cs.LG] UPDATED)
    (2 min) Episodic training is a core ingredient of few-shot learning to train models on tasks with limited labelled data. Despite its success, episodic training remains largely understudied, prompting us to ask the question: what is the best way to sample episodes? In this paper, we first propose a method to approximate episode sampling distributions based on their difficulty. Building on this method, we perform an extensive analysis and find that sampling uniformly over episode difficulty outperforms other sampling schemes, including curriculum and easy-/hard-mining. As the proposed sampling method is algorithm agnostic, we can leverage these insights to improve few-shot learning accuracies across many episodic training algorithms. We demonstrate the efficacy of our method across popular few-shot learning datasets, algorithms, network architectures, and protocols.
    Perspective Transformation Layer. (arXiv:2201.05706v1 [cs.CV])
    (2 min) Incorporating geometric transformations that reflect the relative position changes between an observer and an object into computer vision and deep learning models has attracted much attention in recent years. However, the existing proposals mainly focus on affine transformations that cannot fully show viewpoint changes. Furthermore, current solutions often apply a neural network module to learn a single transformation matrix, which ignores the possibility for various viewpoints and creates extra to-be-trained module parameters. In this paper, a layer (PT layer) is proposed to learn the perspective transformations that not only model the geometries in affine transformation but also reflect the viewpoint changes. In addition, being able to be directly trained with gradient descent like traditional layers such as convolutional layers, a single proposed PT layer can learn an adjustable number of multiple viewpoints without training extra module parameters. The experiments and evaluations confirm the superiority of the proposed PT layer.
    Deciding Not To Decide. (arXiv:2201.05818v1 [cs.AI])
    (2 min) Sometimes unexpected, novel, unconceivable events enter our lives. The cause-effect mappings that usually guide our behaviour are destroyed. Surprised and shocked by possibilities that we had never imagined, we are unable to make any decision beyond mere routine. Among them there are decisions, such as making investments, that are essential for the long-term survival of businesses as well as the economy at large. We submit that the standard machinery of utility maximization does not apply, but we propose measures inspired by scenario planning and graph analysis, pointing to solutions being explored in machine learning.
    Profitable Strategy Design by Using Deep Reinforcement Learning for Trades on Cryptocurrency Markets. (arXiv:2201.05906v1 [q-fin.TR])
    (2 min) Deep Reinforcement Learning solutions have been applied to different control problems with outperforming and promising results. In this research work we have applied Proximal Policy Optimization, Soft Actor-Critic and Generative Adversarial Imitation Learning to strategy design problem of three cryptocurrency markets. Our input data includes price data and technical indicators. We have implemented a Gym environment based on cryptocurrency markets to be used with the algorithms. Our test results on unseen data shows a great potential for this approach in helping investors with an expert system to exploit the market and gain profit. Our highest gain for an unseen 66 day span is 4850 US dollars per 10000 US dollars investment. We also discuss on how a specific hyperparameter in the environment design can be used to adjust risk in the generated strategies.
    A flow-based IDS using Machine Learning in eBPF. (arXiv:2102.09980v2 [cs.CR] UPDATED)
    (2 min) eBPF is a new technology which allows dynamically loading pieces of code into the Linux kernel. It can greatly speed up networking since it enables the kernel to process certain packets without the involvement of a userspace program. So far eBPF has been used for simple packet filtering applications such as firewalls or Denial of Service protection. We show that it is possible to develop a flow based network intrusion detection system based on machine learning entirely in eBPF. Our solution uses a decision tree and decides for each packet whether it is malicious or not, considering the entire previous context of the network flow. We achieve a performance increase of over 20% compared to the same solution implemented as a userspace program.
    Instant Neural Graphics Primitives with a Multiresolution Hash Encoding. (arXiv:2201.05989v1 [cs.CV])
    (2 min) Neural graphics primitives, parameterized by fully connected neural networks, can be costly to train and evaluate. We reduce this cost with a versatile new input encoding that permits the use of a smaller network without sacrificing quality, thus significantly reducing the number of floating point and memory access operations: a small neural network is augmented by a multiresolution hash table of trainable feature vectors whose values are optimized through stochastic gradient descent. The multiresolution structure allows the network to disambiguate hash collisions, making for a simple architecture that is trivial to parallelize on modern GPUs. We leverage this parallelism by implementing the whole system using fully-fused CUDA kernels with a focus on minimizing wasted bandwidth and compute operations. We achieve a combined speedup of several orders of magnitude, enabling training of high-quality neural graphics primitives in a matter of seconds, and rendering in tens of milliseconds at a resolution of ${1920\!\times\!1080}$.
    Universal Online Learning: an Optimistically Universal Learning Rule. (arXiv:2201.05947v1 [cs.LG])
    (2 min) We study the subject of universal online learning with non-i.i.d. processes for bounded losses. The notion of an universally consistent learning was defined by Hanneke in an effort to study learning theory under minimal assumptions, where the objective is to obtain low long-run average loss for any target function. We are interested in characterizing processes for which learning is possible and whether there exist learning rules guaranteed to be universally consistent given the only assumption that such learning is possible. The case of unbounded losses is very restrictive, since the learnable processes almost surely visit a finite number of points and as a result, simple memorization is optimistically universal. We focus on the bounded setting and give a complete characterization of the processes admitting strong and weak universal learning. We further show that k-nearest neighbor algorithm (kNN) is not optimistically universal and present a novel variant of 1NN which is optimistically universal for general input and value spaces in both strong and weak setting. This closes all COLT 2021 open problems posed by Hanneke on universal online learning.
    VAC-CNN: A Visual Analytics System for Comparative Studies of Deep Convolutional Neural Networks. (arXiv:2110.13252v2 [cs.LG] UPDATED)
    (2 min) The rapid development of Convolutional Neural Networks (CNNs) in recent years has triggered significant breakthroughs in many machine learning (ML) applications. The ability to understand and compare various CNN models available is thus essential. The conventional approach with visualizing each model's quantitative features, such as classification accuracy and computational complexity, is not sufficient for a deeper understanding and comparison of the behaviors of different models. Moreover, most of the existing tools for assessing CNN behaviors only support comparison between two models and lack the flexibility of customizing the analysis tasks according to user needs. This paper presents a visual analytics system, VAC-CNN (Visual Analytics for Comparing CNNs), that supports the in-depth inspection of a single CNN model as well as comparative studies of two or more models. The ability to compare a larger number of (e.g., tens of) models especially distinguishes our system from previous ones. With a carefully designed model visualization and explaining support, VAC-CNN facilitates a highly interactive workflow that promptly presents both quantitative and qualitative information at each analysis stage. We demonstrate VAC-CNN's effectiveness for assisting novice ML practitioners in evaluating and comparing multiple CNN models through two use cases and one preliminary evaluation study using the image classification tasks on the ImageNet dataset.
    Active Learning for Human-in-the-Loop Customs Inspection. (arXiv:2010.14282v2 [cs.LG] UPDATED)
    (2 min) We study the human-in-the-loop customs inspection scenario, where an AI-assisted algorithm supports customs officers by recommending a set of imported goods to be inspected. If the inspected items are fraudulent, the officers can levy extra duties. Th formed logs are then used as additional training data for successive iterations. Choosing to inspect suspicious items first leads to an immediate gain in customs revenue, yet such inspections may not bring new insights for learning dynamic traffic patterns. On the other hand, inspecting uncertain items can help acquire new knowledge, which will be used as a supplementary training resource to update the selection systems. Based on multiyear customs datasets obtained from three countries, we demonstrate that some degree of exploration is necessary to cope with domain shifts in trade data. The results show that a hybrid strategy of selecting likely fraudulent and uncertain items will eventually outperform the exploitation-only strategy.
    Generating synthetic transactional profiles. (arXiv:2111.01531v2 [cs.LG] UPDATED)
    (2 min) Financial institutions use clients' payment transactions in numerous banking applications. Transactions are very personal and rich in behavioural patterns, often unique to individuals, which make them equivalent to personally identifiable information in some cases. In this paper, we generate synthetic transactional profiles using machine learning techniques with the goal to preserve both data utility and privacy. A challenge we faced was to deal with sparse vectors due to the few spending categories a client uses compared to all the ones available. We measured data utility by calculating common insights used by the banking industry on both the original and the synthetic data-set. Our approach shows that neural network models can generate valuable synthetic data in such context. Finally, we tried privacy-preserving techniques and observed its effect on models' performances.
    First steps on Gamification of Lung Fluid Cells Annotations in the Flower Domain. (arXiv:2111.03663v2 [eess.IV] UPDATED)
    (2 min) Annotating data, especially in the medical domain, requires expert knowledge and a lot of effort. This limits the amount and/or usefulness of available medical data sets for experimentation. Therefore, developing strategies to increase the number of annotations while lowering the needed domain knowledge is of interest. A possible strategy is the use of gamification, i.e. transforming the annotation task into a game. We propose an approach to gamify the task of annotating lung fluid cells from pathological whole slide images (WSIs). As the domain is unknown to non-expert annotators, we transform images of cells to the domain of flower images using a CycleGAN architecture. In this more assessable domain, non-expert annotators can be (t)asked to annotate different kinds of flowers in a playful setting. In order to provide a proof of concept, this work shows that the domain transfer is possible by evaluating an image classification network trained on real cell images and tested on the cell images generated by the CycleGAN network (reconstructed cell images) as well as real cell images. The classification network reaches an average accuracy of 94.73 % on the original lung fluid cells and 95.25 % on the transformed lung fluid cells, respectively. Our study lays the foundation for future research on gamification using CycleGANs.
    CLUE: Contextualised Unified Explainable Learning of User Engagement in Video Lectures. (arXiv:2201.05651v1 [cs.AI])
    (2 min) Predicting contextualised engagement in videos is a long-standing problem that has been popularly attempted by exploiting the number of views or the associated likes using different computational methods. The recent decade has seen a boom in online learning resources, and during the pandemic, there has been an exponential rise of online teaching videos without much quality control. The quality of the content could be improved if the creators could get constructive feedback on their content. Employing an army of domain expert volunteers to provide feedback on the videos might not scale. As a result, there has been a steep rise in developing computational methods to predict a user engagement score that is indicative of some form of possible user engagement, i.e., to what level a user would tend to engage with the content. A drawback in current methods is that they model various features separately, in a cascaded approach, that is prone to error propagation. Besides, most of them do not provide crucial explanations on how the creator could improve their content. In this paper, we have proposed a new unified model, CLUE for the educational domain, which learns from the features extracted from freely available public online teaching videos and provides explainable feedback on the video along with a user engagement score. Given the complexity of the task, our unified framework employs different pre-trained models working together as an ensemble of classifiers. Our model exploits various multi-modal features to model the complexity of language, context agnostic information, textual emotion of the delivered content, animation, speaker's pitch and speech emotions. Under a transfer learning setup, the overall model, in the unified space, is fine-tuned for downstream applications.
    Identifying optimal cycles in quantum thermal machines with reinforcement-learning. (arXiv:2108.13525v2 [quant-ph] UPDATED)
    (2 min) The optimal control of open quantum systems is a challenging task but has a key role in improving existing quantum information processing technologies. We introduce a general framework based on Reinforcement Learning to discover optimal thermodynamic cycles that maximize the power of out-of-equilibrium quantum heat engines and refrigerators. We apply our method, based on the soft actor-critic algorithm, to three systems: a benchmark two-level system heat engine, where we find the optimal known cycle; an experimentally realistic refrigerator based on a superconducting qubit that generates coherence, where we find a non-intuitive control sequence that outperform previous cycles proposed in literature; a heat engine based on a quantum harmonic oscillator, where we find a cycle with an elaborate structure that outperforms the optimized Otto cycle. We then evaluate the corresponding efficiency at maximum power.
    Sequence Parallelism: Long Sequence Training from System Perspective. (arXiv:2105.13120v2 [cs.LG] UPDATED)
    (2 min) Self-attention suffers from quadratic memory requirements with respect to the sequence length. In this work, we propose sequence parallelism, a memory-efficient parallelism method to help us break input sequence length limitation and train with longer sequences on GPUs efficiently. Our approach is compatible with most existing parallelisms. More importantly, we no longer require a single device to hold the whole sequence. Specifically, we split the input sequence into multiple chunks and feed each chunk into its corresponding device (i.e. GPU). To compute the attention output, we integrated ring-style communication with self-attention calculation and proposed Ring Self-Attention (RSA). Experiments show that sequence parallelism performs well when scaling with batch size and sequence length. Compared with tensor parallelism, our approach achieved $13.7\times$ and $3.0\times$ maximum batch size and sequence length respectively when scaling up to 64 NVIDIA P100 GPUs.
    Improving Generalization in Meta-RL with Imaginary Tasks from Latent Dynamics Mixture. (arXiv:2105.13524v2 [cs.LG] UPDATED)
    (0 min) The generalization ability of most meta-reinforcement learning (meta-RL) methods is largely limited to test tasks that are sampled from the same distribution used to sample training tasks. To overcome the limitation, we propose Latent Dynamics Mixture (LDM) that trains a reinforcement learning agent with imaginary tasks generated from mixtures of learned latent dynamics. By training a policy on mixture tasks along with original training tasks, LDM allows the agent to prepare for unseen test tasks during training and prevents the agent from overfitting the training tasks. LDM significantly outperforms standard meta-RL methods in test returns on the gridworld navigation and MuJoCo tasks where we strictly separate the training task distribution and the test task distribution.
    Ensembling Off-the-shelf Models for GAN Training. (arXiv:2112.09130v2 [cs.CV] UPDATED)
    (0 min) The advent of large-scale training has produced a cornucopia of powerful visual recognition models. However, generative models, such as GANs, have traditionally been trained from scratch in an unsupervised manner. Can the collective "knowledge" from a large bank of pretrained vision models be leveraged to improve GAN training? If so, with so many models to choose from, which one(s) should be selected, and in what manner are they most effective? We find that pretrained computer vision models can significantly improve performance when used in an ensemble of discriminators. Notably, the particular subset of selected models greatly affects performance. We propose an effective selection mechanism, by probing the linear separability between real and fake samples in pretrained model embeddings, choosing the most accurate model, and progressively adding it to the discriminator ensemble. Interestingly, our method can improve GAN training in both limited data and large-scale settings. Given only 10k training samples, our FID on LSUN Cat matches the StyleGAN2 trained on 1.6M images. On the full dataset, our method improves FID by 1.5x to 2x on cat, church, and horse categories of LSUN.
    Risk-Monotonicity in Statistical Learning. (arXiv:2011.14126v5 [cs.LG] UPDATED)
    (0 min) Acquisition of data is a difficult task in many applications of machine learning, and it is only natural that one hopes and expects the population risk to decrease (better performance) monotonically with increasing data points. It turns out, somewhat surprisingly, that this is not the case even for the most standard algorithms that minimize the empirical risk. Non-monotonic behavior of the risk and instability in training have manifested and appeared in the popular deep learning paradigm under the description of double descent. These problems highlight the current lack of understanding of learning algorithms and generalization. It is, therefore, crucial to pursue this concern and provide a characterization of such behavior. In this paper, we derive the first consistent and risk-monotonic (in high probability) algorithms for a general statistical learning setting under weak assumptions, consequently answering some questions posed by Viering et al. 2019 on how to avoid non-monotonic behavior of risk curves. We further show that risk monotonicity need not necessarily come at the price of worse excess risk rates. To achieve this, we derive new empirical Bernstein-like concentration inequalities of independent interest that hold for certain non-i.i.d.~processes such as Martingale Difference Sequences.
    Learning Dynamical Systems with Side Information. (arXiv:2008.10135v2 [math.OC] UPDATED)
    (2 min) We present a mathematical and computational framework for the problem of learning a dynamical system from noisy observations of a few trajectories and subject to side information. Side information is any knowledge we might have about the dynamical system we would like to learn besides trajectory data. It is typically inferred from domain-specific knowledge or basic principles of a scientific discipline. We are interested in explicitly integrating side information into the learning process in order to compensate for scarcity of trajectory observations. We identify six types of side information that arise naturally in many applications and lead to convex constraints in the learning problem. First, we show that when our model for the unknown dynamical system is parameterized as a polynomial, one can impose our side information constraints computationally via semidefinite programming. We then demonstrate the added value of side information for learning the dynamics of basic models in physics and cell biology, as well as for learning and controlling the dynamics of a model in epidemiology. Finally, we study how well polynomial dynamical systems can approximate continuously-differentiable ones while satisfying side information (either exactly or approximately). Our overall learning methodology combines ideas from convex optimization, real algebra, dynamical systems, and functional approximation theory, and can potentially lead to new synergies between these areas.
    CLIP-TD: CLIP Targeted Distillation for Vision-Language Tasks. (arXiv:2201.05729v1 [cs.CV])
    (0 min) Contrastive language-image pretraining (CLIP) links vision and language modalities into a unified embedding space, yielding the tremendous potential for vision-language (VL) tasks. While early concurrent works have begun to study this potential on a subset of tasks, important questions remain: 1) What is the benefit of CLIP on unstudied VL tasks? 2) Does CLIP provide benefit in low-shot or domain-shifted scenarios? 3) Can CLIP improve existing approaches without impacting inference or pretraining complexity? In this work, we seek to answer these questions through two key contributions. First, we introduce an evaluation protocol that includes Visual Commonsense Reasoning (VCR), Visual Entailment (SNLI-VE), and Visual Question Answering (VQA), across a variety of data availability constraints and conditions of domain shift. Second, we propose an approach, named CLIP Targeted Distillation (CLIP-TD), to intelligently distill knowledge from CLIP into existing architectures using a dynamically weighted objective applied to adaptively selected tokens per instance. Experiments demonstrate that our proposed CLIP-TD leads to exceptional gains in the low-shot (up to 51.9%) and domain-shifted (up to 71.3%) conditions of VCR, while simultaneously improving performance under standard fully-supervised conditions (up to 2%), achieving state-of-art performance on VCR compared to other single models that are pretrained with image-text data only. On SNLI-VE, CLIP-TD produces significant gains in low-shot conditions (up to 6.6%) as well as fully supervised (up to 3%). On VQA, CLIP-TD provides improvement in low-shot (up to 9%), and in fully-supervised (up to 1.3%). Finally, CLIP-TD outperforms concurrent works utilizing CLIP for finetuning, as well as baseline naive distillation approaches. Code will be made available.
    Quantum Ensemble for Classification. (arXiv:2007.01028v3 [cs.LG] UPDATED)
    (0 min) A powerful way to improve performance in machine learning is to construct an ensemble that combines the predictions of multiple models. Ensemble methods are often much more accurate and lower variance than the individual classifiers that make them up but have high requirements in terms of memory and computational time. In fact, a large number of alternative algorithms is usually adopted, each requiring to query all available data. We propose a new quantum algorithm that exploits quantum superposition, entanglement and interference to build an ensemble of classification models. Thanks to the generation of the several quantum trajectories in superposition, we obtain $B$ transformations of the quantum state which encodes the training set in only $log\left(B\right)$ operations. This implies exponential growth of the ensemble size while increasing linearly the depth of the correspondent circuit. Furthermore, when considering the overall cost of the algorithm, we show that the training of a single weak classifier impacts additively the overall time complexity rather than multiplicatively, as it usually happens in classical ensemble methods. We also present small-scale experiments on real-world datasets, defining a quantum version of the cosine classifier and using the IBM qiskit environment to show how the algorithms work.
    Recursive Bayesian Networks: Generalising and Unifying Probabilistic Context-Free Grammars and Dynamic Bayesian Networks. (arXiv:2111.01853v4 [cs.LG] UPDATED)
    (0 min) Probabilistic context-free grammars (PCFGs) and dynamic Bayesian networks (DBNs) are widely used sequence models with complementary strengths and limitations. While PCFGs allow for nested hierarchical dependencies (tree structures), their latent variables (non-terminal symbols) have to be discrete. In contrast, DBNs allow for continuous latent variables, but the dependencies are strictly sequential (chain structure). Therefore, neither can be applied if the latent variables are assumed to be continuous and also to have a nested hierarchical dependency structure. In this paper, we present Recursive Bayesian Networks (RBNs), which generalise and unify PCFGs and DBNs, combining their strengths and containing both as special cases. RBNs define a joint distribution over tree-structured Bayesian networks with discrete or continuous latent variables. The main challenge lies in performing joint inference over the exponential number of possible structures and the continuous variables. We provide two solutions: 1) For arbitrary RBNs, we generalise inside and outside probabilities from PCFGs to the mixed discrete-continuous case, which allows for maximum posterior estimates of the continuous latent variables via gradient descent, while marginalising over network structures. 2) For Gaussian RBNs, we additionally derive an analytic approximation, allowing for robust parameter optimisation and Bayesian inference. The capacity and diverse applications of RBNs are illustrated on two examples: In a quantitative evaluation on synthetic data, we demonstrate and discuss the advantage of RBNs for segmentation and tree induction from noisy sequences, compared to change point detection and hierarchical clustering. In an application to musical data, we approach the unsolved problem of hierarchical music analysis from the raw note level and compare our results to expert annotations.
    Embodied Self-supervised Learning by Coordinated Sampling and Training. (arXiv:2006.13350v2 [eess.AS] UPDATED)
    (0 min) Self-supervised learning can significantly improve the performance of downstream tasks, however, the dimensions of learned representations normally lack explicit physical meanings. In this work, we propose a novel self-supervised approach to solve inverse problems by employing the corresponding physical forward process so that the learned representations can have explicit physical meanings. The proposed approach works in an analysis-by-synthesis manner to learn an inference network by iteratively sampling and training. At the sampling step, given observed data, the inference network is used to approximate the intractable posterior, from which we sample input parameters and feed them to a physical process to generate data in the observational space; At the training step, the same network is optimized with the sampled paired data. We prove the feasibility of the proposed method by tackling the acoustic-to-articulatory inversion problem to infer articulatory information from speech. Given an articulatory synthesizer, an inference model can be trained completely from scratch with random initialization. Our experiments demonstrate that the proposed method can converge steadily and the network learns to control the articulatory synthesizer to speak like a human. We also demonstrate that trained models can generalize well to unseen speakers or even new languages, and performance can be further improved through self-adaptation.
    SimCVD: Simple Contrastive Voxel-Wise Representation Distillation for Semi-Supervised Medical Image Segmentation. (arXiv:2108.06227v3 [cs.CV] UPDATED)
    (0 min) Automated segmentation in medical image analysis is a challenging task that requires a large amount of manually labeled data. However, most existing learning-based approaches usually suffer from limited manually annotated medical data, which poses a major practical problem for accurate and robust medical image segmentation. In addition, most existing semi-supervised approaches are usually not robust compared with the supervised counterparts, and also lack explicit modeling of geometric structure and semantic information, both of which limit the segmentation accuracy. In this work, we present SimCVD, a simple contrastive distillation framework that significantly advances state-of-the-art voxel-wise representation learning. We first describe an unsupervised training strategy, which takes two views of an input volume and predicts their signed distance maps of object boundaries in a contrastive objective, with only two independent dropout as mask. This simple approach works surprisingly well, performing on the same level as previous fully supervised methods with much less labeled data. We hypothesize that dropout can be viewed as a minimal form of data augmentation and makes the network robust to representation collapse. Then, we propose to perform structural distillation by distilling pair-wise similarities. We evaluate SimCVD on two popular datasets: the Left Atrial Segmentation Challenge (LA) and the NIH pancreas CT dataset. The results on the LA dataset demonstrate that, in two types of labeled ratios (i.e., 20% and 10%), SimCVD achieves an average Dice score of 90.85% and 89.03% respectively, a 0.91% and 2.22% improvement compared to previous best results. Our method can be trained in an end-to-end fashion, showing the promise of utilizing SimCVD as a general framework for downstream tasks, such as medical image synthesis, enhancement, and registration.
    Explaining and Measuring Functionalities of Malware Detectors. (arXiv:2111.10085v2 [cs.CR] UPDATED)
    (0 min) Numerous open-source and commercial malware detectors are available. However, their efficacy is threatened by new adversarial attacks, whereby malware attempts to evade detection, e.g., by performing feature-space manipulation. In this work, we propose an explainability-guided and model-agnostic framework for measuring the ability of malware to evade detection. The framework introduces the concept of Accrued Malicious Magnitude (AMM) to identify which malware features should be manipulated to maximize the likelihood of evading detection. We then use this framework to test several state-of-the-art malware detectors ability to detect manipulated malware. We find that (i) commercial antivirus engines are vulnerable to AMM-guided manipulated samples; (ii) the ability of a manipulated malware generated using one detector to evade detection by another detector (i.e., transferability) depends on the overlap of features with large AMM values between the different detectors; and (iii) AMM values effectively measure the importance of features and explain the ability to evade detection. Our findings shed light on the weaknesses of current malware detectors, as well as how they can be improved.
    ResNEsts and DenseNEsts: Block-based DNN Models with Improved Representation Guarantees. (arXiv:2111.05496v2 [cs.LG] UPDATED)
    (0 min) Models recently used in the literature proving residual networks (ResNets) are better than linear predictors are actually different from standard ResNets that have been widely used in computer vision. In addition to the assumptions such as scalar-valued output or single residual block, these models have no nonlinearities at the final residual representation that feeds into the final affine layer. To codify such a difference in nonlinearities and reveal a linear estimation property, we define ResNEsts, i.e., Residual Nonlinear Estimators, by simply dropping nonlinearities at the last residual representation from standard ResNets. We show that wide ResNEsts with bottleneck blocks can always guarantee a very desirable training property that standard ResNets aim to achieve, i.e., adding more blocks does not decrease performance given the same set of basis elements. To prove that, we first recognize ResNEsts are basis function models that are limited by a coupling problem in basis learning and linear prediction. Then, to decouple prediction weights from basis learning, we construct a special architecture termed augmented ResNEst (A-ResNEst) that always guarantees no worse performance with the addition of a block. As a result, such an A-ResNEst establishes empirical risk lower bounds for a ResNEst using corresponding bases. Our results demonstrate ResNEsts indeed have a problem of diminishing feature reuse; however, it can be avoided by sufficiently expanding or widening the input space, leading to the above-mentioned desirable property. Inspired by the DenseNets that have been shown to outperform ResNets, we also propose a corresponding new model called Densely connected Nonlinear Estimator (DenseNEst). We show that any DenseNEst can be represented as a wide ResNEst with bottleneck blocks. Unlike ResNEsts, DenseNEsts exhibit the desirable property without any special architectural re-design.
    Common Phone: A Multilingual Dataset for Robust Acoustic Modelling. (arXiv:2201.05912v1 [eess.AS])
    (0 min) Current state of the art acoustic models can easily comprise more than 100 million parameters. This growing complexity demands larger training datasets to maintain a decent generalization of the final decision function. An ideal dataset is not necessarily large in size, but large with respect to the amount of unique speakers, utilized hardware and varying recording conditions. This enables a machine learning model to explore as much of the domain-specific input space as possible during parameter estimation. This work introduces Common Phone, a gender-balanced, multilingual corpus recorded from more than 76.000 contributors via Mozilla's Common Voice project. It comprises around 116 hours of speech enriched with automatically generated phonetic segmentation. A Wav2Vec 2.0 acoustic model was trained with the Common Phone to perform phonetic symbol recognition and validate the quality of the generated phonetic annotation. The architecture achieved a PER of 18.1 % on the entire test set, computed with all 101 unique phonetic symbols, showing slight differences between the individual languages. We conclude that Common Phone provides sufficient variability and reliable phonetic annotation to help bridging the gap between research and application of acoustic models.
    Fast and Accurate Optical Fiber Channel Modeling Using Generative Adversarial Network. (arXiv:2002.12648v2 [cs.IT] UPDATED)
    (0 min) In this work, a new data-driven fiber channel modeling method, generative adversarial network (GAN) is investigated to learn the distribution of fiber channel transfer function. Our investigation focuses on joint channel effects of attenuation, chromic dispersion, self-phase modulation (SPM), and amplified spontaneous emission (ASE) noise. To achieve the success of GAN for channel modeling, we modify the loss function, design the condition vector of input and address the mode collapse for the long-haul transmission. The effective architecture, parameters, and training skills of GAN are also displayed in the paper. The results show that the proposed method can learn the accurate transfer function of the fiber channel. The transmission distance of modeling can be up to 1000 km and can be extended to arbitrary distance theoretically. Moreover, GAN shows robust generalization abilities under different optical launch powers, modulation formats, and input signal distributions. Comparing the complexity of GAN with the split-step Fourier method (SSFM), the total multiplication number is only 2% of SSFM and the running time is less than 0.1 seconds for 1000-km transmission, versus 400 seconds using the SSFM under the same hardware and software conditions, which highlights the remarkable reduction in complexity of the fiber channel modeling.
    Self-supervised Representation Learning of Neuronal Morphologies. (arXiv:2112.12482v2 [stat.ML] UPDATED)
    (0 min) Understanding the diversity of cell types and their function in the brain is one of the key challenges in neuroscience. The advent of large-scale datasets has given rise to the need of unbiased and quantitative approaches to cell type classification. We present GraphDINO, a purely data-driven approach to learning a low dimensional representation of the 3D morphology of neurons. GraphDINO is a novel graph representation learning method for spatial graphs utilizing self-supervised learning on transformer models. It combines attention-based global interaction between nodes and classic graph convolutional processing. We show, in two different species and cortical areas, that this method is able to yield morphological cell type clustering that is comparable to manual feature-based classification and shows a good correspondence to expert-labeled cell types. Our method is applicable beyond neuroscience in settings where samples in a dataset are graphs and graph-level embeddings are desired.
    Optimizing Area Under the Curve Measures via Matrix Factorization for Predicting Drug-Target Interaction with Multiple Similarities. (arXiv:2105.01545v2 [q-bio.QM] UPDATED)
    (0 min) In drug discovery, identifying drug-target interactions (DTIs) via experimental approaches is a tedious and expensive procedure. Computational methods efficiently predict DTIs and recommend a small part of potential interacting pairs for further experimental confirmation, accelerating the drug discovery process. Although it has been shown that fusing heterogeneous drug and target similarities can improve the prediction ability, the existing similarity combination methods ignore the interaction consistency for neighbour entities which is more crucial for the DTI prediction model. Furthermore, area under the precision-recall curve (AUPR) that emphasizes the accuracy of top-ranked pairs and area under the receiver operating characteristic curve (AUC) that heavily punishes the existence of low ranked interacting pairs are two widely used evaluation metrics in DTI prediction. However, the two metrics are seldom considered as losses within existing DTI prediction methods. This paper first proposes two matrix factorization (MF) methods that optimize AUPR and AUC using convex surrogate losses respectively, and then develops an ensemble MF approach takes advantage of the two area under the curve metrics by combining the two single metric based MF models. Both three proposed approaches incorporate a novel local interaction consistency aware similarity interaction method to generate fused drug and target similarities that preserve vital information from the more reliable view. Experimental results over five datasets under different prediction settings show that the proposed methods outperform various competitors in terms of the metric(s) they optimize. In addition, the validation on the top ranked novel predictions confirms the ability of our methods to discover potential new DTIs.
    A Law of Iterated Logarithm for Multi-Agent Reinforcement Learning. (arXiv:2110.15092v3 [cs.LG] UPDATED)
    (0 min) In Multi-Agent Reinforcement Learning (MARL), multiple agents interact with a common environment, as also with each other, for solving a shared problem in sequential decision-making. It has wide-ranging applications in gaming, robotics, finance, etc. In this work, we derive a novel law of iterated logarithm for a family of distributed nonlinear stochastic approximation schemes that is useful in MARL. In particular, our result describes the convergence rate on almost every sample path where the algorithm converges. This result is the first of its kind in the distributed setup and provides deeper insights than the existing ones, which only discuss convergence rates in the expected or the CLT sense. Importantly, our result holds under significantly weaker assumptions: neither the gossip matrix needs to be doubly stochastic nor the stepsizes square summable. As an application, we show that, for the stepsize $n^{-\gamma}$ with $\gamma \in (0, 1),$ the distributed TD(0) algorithm with linear function approximation has a convergence rate of $O(\sqrt{n^{-\gamma} \ln n })$ a.s.; for the $1/n$ type stepsize, the same is $O(\sqrt{n^{-1} \ln \ln n})$ a.s. These decay rates do not depend on the graph depicting the interactions among the different agents.
    Unifying Heterogeneous Electronic Health Records Systems via Text-Based Code Embedding. (arXiv:2111.09098v4 [cs.CL] UPDATED)
    (0 min) EHR systems lack a unified code system forrepresenting medical concepts, which acts asa barrier for the deployment of deep learningmodels in large scale to multiple clinics and hos-pitals. To overcome this problem, we introduceDescription-based Embedding,DescEmb, a code-agnostic representation learning framework forEHR. DescEmb takes advantage of the flexibil-ity of neural language understanding models toembed clinical events using their textual descrip-tions rather than directly mapping each event toa dedicated embedding. DescEmb outperformedtraditional code-based embedding in extensiveexperiments, especially in a zero-shot transfertask (one hospital to another), and was able totrain a single unified model for heterogeneousEHR datasets.
    Disentanglement enables cross-domain Hippocampus Segmentation. (arXiv:2201.05650v1 [eess.IV])
    (0 min) Limited amount of labelled training data are a common problem in medical imaging. This makes it difficult to train a well-generalised model and therefore often leads to failure in unknown domains. Hippocampus segmentation from magnetic resonance imaging (MRI) scans is critical for the diagnosis and treatment of neuropsychatric disorders. Domain differences in contrast or shape can significantly affect segmentation. We address this issue by disentangling a T1-weighted MRI image into its content and domain. This separation enables us to perform a domain transfer and thus convert data from new sources into the training domain. This step thus simplifies the segmentation problem, resulting in higher quality segmentations. We achieve the disentanglement with the proposed novel methodology 'Content Domain Disentanglement GAN', and we propose to retrain the UNet on the transformed outputs to deal with GAN-specific artefacts. With these changes, we are able to improve performance on unseen domains by 6-13% and outperform state-of-the-art domain transfer methods.
    Stochastic Multi-Armed Bandits with Control Variates. (arXiv:2105.03962v3 [cs.LG] UPDATED)
    (0 min) This paper studies a new variant of the stochastic multi-armed bandits problem where auxiliary information about the arm rewards is available in the form of control variates. In many applications like queuing and wireless networks, the arm rewards are functions of some exogenous variables. The mean values of these variables are known a priori from historical data and can be used as control variates. Leveraging the theory of control variates, we obtain mean estimates with smaller variance and tighter confidence bounds. We develop an upper confidence bound based algorithm named UCB-CV and characterize the regret bounds in terms of the correlation between rewards and control variates when they follow a multivariate normal distribution. We also extend UCB-CV to other distributions using resampling methods like Jackknifing and Splitting. Experiments on synthetic problem instances validate performance guarantees of the proposed algorithms.
    Posterior Adaptation With New Priors. (arXiv:2007.01386v3 [cs.LG] UPDATED)
    (0 min) Classification approaches based on the direct estimation and analysis of posterior probabilities will degrade if the original class priors begin to change. We prove that a unique (up to scale) solution is possible to recover the data likelihoods for a test example from its original class posteriors and dataset priors. Given the recovered likelihoods and a set of new priors, the posteriors can be re-computed using Bayes' Rule to reflect the influence of the new priors. The method is simple to compute and allows a dynamic update of the original posteriors.
    Scientific Machine Learning through Physics-Informed Neural Networks: Where we are and What's next. (arXiv:2201.05624v1 [cs.LG])
    (0 min) Physic-Informed Neural Networks (PINN) are neural networks (NNs) that encode model equations, like Partial Differential Equations (PDE), as a component of the neural network itself. PINNs are nowadays used to solve PDEs, fractional equations, and integral-differential equations. This novel methodology has arisen as a multi-task learning framework in which a NN must fit observed data while reducing a PDE residual. This article provides a comprehensive review of the literature on PINNs: while the primary goal of the study was to characterize these networks and their related advantages and disadvantages, the review also attempts to incorporate publications on a larger variety of issues, including physics-constrained neural networks (PCNN), where the initial or boundary conditions are directly embedded in the NN structure rather than in the loss functions. The study indicates that most research has focused on customizing the PINN through different activation functions, gradient optimization techniques, neural network structures, and loss function structures. Despite the wide range of applications for which PINNs have been used, by demonstrating their ability to be more feasible in some contexts than classical numerical techniques like Finite Element Method (FEM), advancements are still possible, most notably theoretical issues that remain unresolved.
    Block Policy Mirror Descent. (arXiv:2201.05756v1 [cs.LG])
    (0 min) In this paper, we present a new class of policy gradient (PG) methods, namely the block policy mirror descent (BPMD) methods for solving a class of regularized reinforcement learning (RL) problems with (strongly) convex regularizers. Compared to the traditional PG methods with batch update rule, which visit and update the policy for every state, BPMD methods have cheap per-iteration computation via a partial update rule that performs the policy update on a sampled state. Despite the nonconvex nature of the problem and a partial update rule, BPMD methods achieve fast linear convergence to the global optimality. We further extend BPMD methods to the stochastic setting, by utilizing stochastic first-order information constructed from samples. We establish $\cO(1/\epsilon)$ (resp. $\cO(1/\epsilon^2)$) sample complexity for the strongly convex (resp. non-strongly convex) regularizers, with different procedures for constructing the stochastic first-order information, where $\epsilon$ denotes the target accuracy. To the best of our knowledge, this is the first time that block coordinate descent methods have been developed and analyzed for policy optimization in reinforcement learning.
    Automated causal inference in application to randomized controlled clinical trials. (arXiv:2201.05773v1 [stat.ME])
    (0 min) Randomized controlled trials (RCTs) are considered as the gold standard for testing causal hypotheses in the clinical domain. However, the investigation of prognostic variables of patient outcome in a hypothesized cause-effect route is not feasible using standard statistical methods. Here, we propose a new automated causal inference method (AutoCI) built upon the invariant causal prediction (ICP) framework for the causal re-interpretation of clinical trial data. Compared to existing methods, we show that the proposed AutoCI allows to efficiently determine the causal variables with a clear differentiation on two real-world RCTs of endometrial cancer patients with mature outcome and extensive clinicopathological and molecular data. This is achieved via suppressing the causal probability of non-causal variables by a wide margin. In ablation studies, we further demonstrate that the assignment of causal probabilities by AutoCI remain consistent in the presence of confounders. In conclusion, these results confirm the robustness and feasibility of AutoCI for future applications in real-world clinical analysis.
    Noncommutative Geometry of Computational Models and Uniformization for Framed Quiver Varieties. (arXiv:2201.05900v1 [math.AG])
    (0 min) We formulate a mathematical setup for computational neural networks using noncommutative algebras and near-rings, in motivation of quantum automata. We study the moduli space of the corresponding framed quiver representations, and find moduli of Euclidean and non-compact types in light of uniformization.
    Effective and Efficient Graph Learning for Multi-view Clustering. (arXiv:2108.06734v2 [cs.LG] UPDATED)
    (0 min) Despite the impressive clustering performance and efficiency in characterizing both the relationship between data and cluster structure, existing graph-based multi-view clustering methods still have the following drawbacks. They suffer from the expensive time burden due to both the construction of graphs and eigen-decomposition of Laplacian matrix, and fail to explore the cluster structure of large-scale data. Moreover, they require a post-processing to get the final clustering, resulting in suboptimal performance. Furthermore, rank of the learned view-consensus graph cannot approximate the target rank. In this paper, drawing the inspiration from the bipartite graph, we propose an effective and efficient graph learning model for multi-view clustering. Specifically, our method exploits the view-similar between graphs of different views by the minimization of tensor Schatten p-norm, which well characterizes both the spatial structure and complementary information embedded in graphs of different views. We learn view-consensus graph with adaptively weighted strategy and connectivity constraint such that the connected components indicates clusters directly. Our proposed algorithm is time-economical and obtains the stable results and scales well with the data size. Extensive experimental results indicate that our method is superior to state-of-the-art methods.
    Prediction of Drug-Induced TdP Risks Using Machine Learning and Rabbit Ventricular Wedge Assay. (arXiv:2201.05669v1 [q-bio.QM])
    (0 min) The evaluation of drug-induced Torsades de pointes (TdP) risks is crucial in drug safety assessment. In this study, we discuss machine learning approaches in the prediction of drug-induced TdP risks using preclinical data. Specifically, the random forest model was trained on the dataset generated by the rabbit ventricular wedge assay. The model prediction performance was measured on 28 drugs from the Comprehensive In Vitro Proarrhythmia Assay initiative. Leave-one-drug-out cross-validation provided an unbiased estimation of model performance. Stratified bootstrap revealed the uncertainty in the asymptotic model prediction. Our study validated the utility of machine learning approaches in predicting drug-induced TdP risks from preclinical data. Our methods can be extended to other preclinical protocols and serve as a supplementary evaluation in drug safety assessment.
    Unifying Heterogeneous Electronic Health Records Systems via Text-Based Code Embedding. (arXiv:2108.03625v2 [cs.LG] UPDATED)
    (0 min) Substantial increase in the use of Electronic Health Records (EHRs) has opened new frontiers for predictive healthcare. However, while EHR systems are nearly ubiquitous, they lack a unified code system for representing medical concepts. Heterogeneous formats of EHR present a substantial barrier for the training and deployment of state-of-the-art deep learning models at scale. To overcome this problem, we introduce Description-based Embedding, DescEmb, a code-agnostic description-based representation learning framework for predictive modeling on EHR. DescEmb takes advantage of the flexibility of neural language understanding models while maintaining a neutral approach that can be combined with prior frameworks for task-specific representation learning or predictive modeling. We tested our model's capacity on various experiments including prediction tasks, transfer learning and pooled learning. DescEmb shows higher performance in overall experiments compared to code-based approach, opening the door to a text-based approach in predictive healthcare research that is not constrained by EHR structure nor special domain knowledge.
    CAFE: Catastrophic Data Leakage in Vertical Federated Learning. (arXiv:2110.15122v3 [cs.LG] UPDATED)
    (0 min) Recent studies show that private training data can be leaked through the gradients sharing mechanism deployed in distributed machine learning systems, such as federated learning (FL). Increasing batch size to complicate data recovery is often viewed as a promising defense strategy against data leakage. In this paper, we revisit this defense premise and propose an advanced data leakage attack with theoretical justification to efficiently recover batch data from the shared aggregated gradients. We name our proposed method as catastrophic data leakage in vertical federated learning (CAFE). Comparing to existing data leakage attacks, our extensive experimental results on vertical FL settings demonstrate the effectiveness of CAFE to perform large-batch data leakage attack with improved data recovery quality. We also propose a practical countermeasure to mitigate CAFE. Our results suggest that private data participated in standard FL, especially the vertical case, have a high risk of being leaked from the training gradients. Our analysis implies unprecedented and practical data leakage risks in those learning settings. The code of our work is available at https://github.com/DeRafael/CAFE.
    On the Periodic Behavior of Neural Network Training with Batch Normalization and Weight Decay. (arXiv:2106.15739v3 [cs.LG] UPDATED)
    (2 min) Training neural networks with batch normalization and weight decay has become a common practice in recent years. In this work, we show that their combined use may result in a surprising periodic behavior of optimization dynamics: the training process regularly exhibits destabilizations that, however, do not lead to complete divergence but cause a new period of training. We rigorously investigate the mechanism underlying the discovered periodic behavior from both empirical and theoretical points of view and analyze the conditions in which it occurs in practice. We also demonstrate that periodic behavior can be regarded as a generalization of two previously opposing perspectives on training with batch normalization and weight decay, namely the equilibrium presumption and the instability presumption.
    StolenEncoder: Stealing Pre-trained Encoders. (arXiv:2201.05889v1 [cs.CR])
    (2 min) Pre-trained encoders are general-purpose feature extractors that can be used for many downstream tasks. Recent progress in self-supervised learning can pre-train highly effective encoders using a large volume of unlabeled data, leading to the emerging encoder as a service (EaaS). A pre-trained encoder may be deemed confidential because its training often requires lots of data and computation resources as well as its public release may facilitate misuse of AI, e.g., for deepfakes generation. In this paper, we propose the first attack called StolenEncoder to steal pre-trained image encoders. We evaluate StolenEncoder on multiple target encoders pre-trained by ourselves and three real-world target encoders including the ImageNet encoder pre-trained by Google, CLIP encoder pre-trained by OpenAI, and Clarifai's General Embedding encoder deployed as a paid EaaS. Our results show that the encoders stolen by StolenEncoder have similar functionality with the target encoders. In particular, the downstream classifiers built upon a target encoder and a stolen encoder have similar accuracy. Moreover, stealing a target encoder using StolenEncoder requires much less data and computation resources than pre-training it from scratch. We also explore three defenses that perturb feature vectors produced by a target encoder. Our evaluation shows that these defenses are not enough to mitigate StolenEncoder.
    Learn Quasi-stationary Distributions of Finite State Markov Chain. (arXiv:2111.11213v2 [cs.LG] UPDATED)
    (2 min) We propose a reinforcement learning (RL) approach to compute the expression of quasi-stationary distribution. Based on the fixed-point formulation of quasi-stationary distribution, we minimize the KL-divergence of two Markovian path distributions induced by the candidate distribution and the true target distribution. To solve this challenging minimization problem by gradient descent, we apply the reinforcement learning technique by introducing the reward and value functions. We derive the corresponding policy gradient theorem and design an actor-critic algorithm to learn the optimal solution and the value function. The numerical examples of finite state Markov chain are tested to demonstrate the new method.
    A Framework for Pedestrian Sub-classification and Arrival Time Prediction at Signalized Intersection Using Preprocessed Lidar Data. (arXiv:2201.05877v1 [cs.LG])
    (2 min) The mortality rate for pedestrians using wheelchairs was 36% higher than the overall population pedestrian mortality rate. However, there is no data to clarify the pedestrians' categories in both fatal and nonfatal accidents, since police reports often do not keep a record of whether a victim was using a wheelchair or has a disability. Currently, real-time detection of vulnerable road users using advanced traffic sensors installed at the infrastructure side has a great potential to significantly improve traffic safety at the intersection. In this research, we develop a systematic framework with a combination of machine learning and deep learning models to distinguish disabled people from normal walk pedestrians and predict the time needed to reach the next side of the intersection. The proposed framework shows high performance both at vulnerable user classification and arrival time prediction accuracy.
    Exposing the Obscured Influence of State-Controlled Media: A Causal Estimation of Influence Between Media Outlets Via Quotation Propagation. (arXiv:2201.05985v1 [cs.SI])
    (2 min) This study quantifies influence between media outlets by applying a novel methodology that uses causal effect estimation on networks and transformer language models. We demonstrate the obscured influence of state-controlled outlets over other outlets, regardless of orientation, by analyzing a large dataset of quotations from over 100 thousand articles published by the most prominent European and Russian traditional media outlets, appearing between May 2018 and October 2019. The analysis maps out the network structure of influence with news wire services serving as prominent bridges that connect outlets in different geo-political spheres. Overall, this approach demonstrates capabilities to identify and quantify the channels of influence in intermedia agenda setting over specific topics.
    Learning Multi-Objective Curricula for Deep Reinforcement Learning. (arXiv:2110.03032v2 [cs.LG] UPDATED)
    (2 min) Various automatic curriculum learning (ACL) methods have been proposed to improve the sample efficiency and final performance of deep reinforcement learning (DRL). They are designed to control how a DRL agent collects data, which is inspired by how humans gradually adapt their learning processes to their capabilities. For example, ACL can be used for subgoal generation, reward shaping, environment generation, or initial state generation. However, prior work only considers curriculum learning following one of the aforementioned predefined paradigms. It is unclear which of these paradigms are complementary, and how the combination of them can be learned from interactions with the environment. Therefore, in this paper, we propose a unified automatic curriculum learning framework to create multi-objective but coherent curricula that are generated by a set of parametric curriculum modules. Each curriculum module is instantiated as a neural network and is responsible for generating a particular curriculum. In order to coordinate those potentially conflicting modules in unified parameter space, we propose a multi-task hyper-net learning framework that uses a single hyper-net to parameterize all those curriculum modules. In addition to existing hand-designed curricula paradigms, we further design a flexible memory mechanism to learn an abstract curriculum, which may otherwise be difficult to design manually. We evaluate our method on a series of robotic manipulation tasks and demonstrate its superiority over other state-of-the-art ACL methods in terms of sample efficiency and final performance.
    An adaptive stochastic gradient-free approach for high-dimensional blackbox optimization. (arXiv:2006.10887v2 [math.OC] UPDATED)
    (2 min) In this work, we propose a novel adaptive stochastic gradient-free (ASGF) approach for solving high-dimensional nonconvex optimization problems based on function evaluations. We employ a directional Gaussian smoothing of the target function that generates a surrogate of the gradient and assists in avoiding bad local optima by utilizing nonlocal information of the loss landscape. Applying a deterministic quadrature scheme results in a massively scalable technique that is sample-efficient and achieves spectral accuracy. At each step we randomly generate the search directions while primarily following the surrogate of the smoothed gradient. This enables exploitation of the gradient direction while maintaining sufficient space exploration, and accelerates convergence towards the global extrema. In addition, we make use of a local approximation of the Lipschitz constant in order to adaptively adjust the values of all hyperparameters, thus removing the careful fine-tuning of current algorithms that is often necessary to be successful when applied to a large class of learning tasks. As such, the ASGF strategy offers significant improvements when solving high-dimensional nonconvex optimization problems when compared to other gradient-free methods (including the so called "evolutionary strategies") as well as iterative approaches that rely on the gradient information of the objective function. We illustrate the improved performance of this method by providing several comparative numerical studies on benchmark global optimization problems and reinforcement learning tasks.
    Towards Biologically Plausible Convolutional Networks. (arXiv:2106.13031v2 [cs.LG] UPDATED)
    (2 min) Convolutional networks are ubiquitous in deep learning. They are particularly useful for images, as they reduce the number of parameters, reduce training time, and increase accuracy. However, as a model of the brain they are seriously problematic, since they require weight sharing - something real neurons simply cannot do. Consequently, while neurons in the brain can be locally connected (one of the features of convolutional networks), they cannot be convolutional. Locally connected but non-convolutional networks, however, significantly underperform convolutional ones. This is troublesome for studies that use convolutional networks to explain activity in the visual system. Here we study plausible alternatives to weight sharing that aim at the same regularization principle, which is to make each neuron within a pool react similarly to identical inputs. The most natural way to do that is by showing the network multiple translations of the same image, akin to saccades in animal vision. However, this approach requires many translations, and doesn't remove the performance gap. We propose instead to add lateral connectivity to a locally connected network, and allow learning via Hebbian plasticity. This requires the network to pause occasionally for a sleep-like phase of "weight sharing". This method enables locally connected networks to achieve nearly convolutional performance on ImageNet and improves their fit to the ventral stream data, thus supporting convolutional networks as a model of the visual stream.
    Predictive Maintenance for General Aviation Using Convolutional Transformers. (arXiv:2110.03757v2 [cs.LG] UPDATED)
    (2 min) Predictive maintenance systems have the potential to significantly reduce costs for maintaining aircraft fleets as well as provide improved safety by detecting maintenance issues before they come severe. However, the development of such systems has been limited due to a lack of publicly labeled multivariate time series (MTS) sensor data. MTS classification has advanced greatly over the past decade, but there is a lack of sufficiently challenging benchmarks for new methods. This work introduces the NGAFID Maintenance Classification (NGAFID-MC) dataset as a novel benchmark in terms of difficulty, number of samples, and sequence length. NGAFID-MC consists of over 7,500 labeled flights, representing over 11,500 hours of per second flight data recorder readings of 23 sensor parameters. Using this benchmark, we demonstrate that Recurrent Neural Network (RNN) methods are not well suited for capturing temporally distant relationships and propose a new architecture called Convolutional Multiheaded Self Attention (Conv-MHSA) that achieves greater classification performance at greater computational efficiency. We also demonstrate that image inspired augmentations of cutout, mixup, and cutmix, can be used to reduce overfitting and improve generalization in MTS classification. Our best trained models have been incorporated back into the NGAFID to allow users to potentially detect flights that require maintenance as well as provide feedback to further expand and refine the NGAFID-MC dataset.
    Diffusion Tensor Estimation with Transformer Neural Networks. (arXiv:2201.05701v1 [eess.IV])
    (2 min) Diffusion tensor imaging (DTI) is the most widely used tool for studying brain white matter development and degeneration. However, standard DTI estimation methods depend on a large number of high-quality measurements. This would require long scan times and can be particularly difficult to achieve with certain patient populations such as neonates. Here, we propose a method that can accurately estimate the diffusion tensor from only six diffusion-weighted measurements. Our method achieves this by learning to exploit the relationships between the diffusion signals and tensors in neighboring voxels. Our model is based on transformer networks, which represent the state of the art in modeling the relationship between signals in a sequence. In particular, our model consists of two such networks. The first network estimates the diffusion tensor based on the diffusion signals in a neighborhood of voxels. The second network provides more accurate tensor estimations by learning the relationships between the diffusion signals as well as the tensors estimated by the first network in neighboring voxels. Our experiments with three datasets show that our proposed method achieves highly accurate estimations of the diffusion tensor and is significantly superior to three competing methods. Estimations produced by our method with six measurements are comparable with those of standard estimation methods with 30-88 measurements. Hence, our method promises shorter scan times and more reliable assessment of brain white matter, particularly in non-cooperative patients such as neonates and infants.
    Inference via Sparse Coding in a Hierarchical Vision Model. (arXiv:2108.01548v3 [cs.CV] UPDATED)
    (2 min) Sparse coding has been incorporated in models of the visual cortex for its computational advantages and connection to biology. But how the level of sparsity contributes to performance on visual tasks is not well understood. In this work, sparse coding has been integrated into an existing hierarchical V2 model (Hosoya and Hyv\"arinen, 2015), but replacing its independent component analysis (ICA) with an explicit sparse coding in which the degree of sparsity can be controlled. After training, the sparse coding basis functions with a higher degree of sparsity resembled qualitatively different structures, such as curves and corners. The contributions of the models were assessed with image classification tasks, specifically tasks associated with mid-level vision including figure-ground classification, texture classification, and angle prediction between two line stimuli. In addition, the models were assessed in comparison to a texture sensitivity measure that has been reported in V2 (Freeman et al., 2013), and a deleted-region inference task. The results from the experiments show that while sparse coding performed worse than ICA at classifying images, only sparse coding was able to better match the texture sensitivity level of V2 and infer deleted image regions, both by increasing the degree of sparsity in sparse coding. Higher degrees of sparsity allowed for inference over larger deleted image regions. The mechanism that allows for this inference capability in sparse coding is described here.
    Taylor-Lagrange Neural Ordinary Differential Equations: Toward Fast Training and Evaluation of Neural ODEs. (arXiv:2201.05715v1 [cs.LG])
    (2 min) Neural ordinary differential equations (NODEs) -- parametrizations of differential equations using neural networks -- have shown tremendous promise in learning models of unknown continuous-time dynamical systems from data. However, every forward evaluation of a NODE requires numerical integration of the neural network used to capture the system dynamics, making their training prohibitively expensive. Existing works rely on off-the-shelf adaptive step-size numerical integration schemes, which often require an excessive number of evaluations of the underlying dynamics network to obtain sufficient accuracy for training. By contrast, we accelerate the evaluation and the training of NODEs by proposing a data-driven approach to their numerical integration. The proposed Taylor-Lagrange NODEs (TL-NODEs) use a fixed-order Taylor expansion for numerical integration, while also learning to estimate the expansion's approximation error. As a result, the proposed approach achieves the same accuracy as adaptive step-size schemes while employing only low-order Taylor expansions, thus greatly reducing the computational cost necessary to integrate the NODE. A suite of numerical experiments, including modeling dynamical systems, image classification, and density estimation, demonstrate that TL-NODEs can be trained more than an order of magnitude faster than state-of-the-art approaches, without any loss in performance.
    Intrusion Prevention through Optimal Stopping. (arXiv:2111.00289v2 [cs.LG] UPDATED)
    (2 min) We study automated intrusion prevention using reinforcement learning. Following a novel approach, we formulate the problem of intrusion prevention as an (optimal) multiple stopping problem. This formulation gives us insight into the structure of optimal policies, which we show to have threshold properties. For most practical cases, it is not feasible to obtain an optimal defender policy using dynamic programming. We therefore develop a reinforcement learning approach to approximate an optimal policy. Our method for learning and validating policies includes two systems: a simulation system where defender policies are incrementally learned and an emulation system where statistics are produced that drive simulation runs and where learned policies are evaluated. We show that our approach can produce effective defender policies for a practical IT infrastructure of limited size. Inspection of the learned policies confirms that they exhibit threshold properties.
    Autoencoders for Semivisible Jet Detection. (arXiv:2112.02864v3 [hep-ph] UPDATED)
    (2 min) The production of dark matter particles from confining dark sectors may lead to many novel experimental signatures. Depending on the details of the theory, dark quark production in proton-proton collisions could result in semivisible jets of particles: collimated sprays of dark hadrons of which only some are detectable by particle collider experiments. The experimental signature is characterised by the presence of reconstructed missing momentum collinear with the visible components of the jets. This complex topology is sensitive to detector inefficiencies and mis-reconstruction that generate artificial missing momentum. With this work, we propose a signal-agnostic strategy to reject ordinary jets and identify semivisible jets via anomaly detection techniques. A deep neural autoencoder network with jet substructure variables as input proves highly useful for analyzing anomalous jets. The study focuses on the semivisible jet signature; however, the technique can apply to any new physics model that predicts signatures with anomalous jets from non-SM particles.
    Molecular Contrastive Learning of Representations via Graph Neural Networks. (arXiv:2102.10056v2 [cs.LG] UPDATED)
    (2 min) Molecular Machine Learning (ML) bears promise for efficient molecule property prediction and drug discovery. However, labeled molecule data can be expensive and time-consuming to acquire. Due to the limited labeled data, it is a great challenge for supervised-learning ML models to generalize to the giant chemical space. In this work, we present MolCLR: Molecular Contrastive Learning of Representations via Graph Neural Networks (GNNs), a self-supervised learning framework that leverages large unlabeled data (~10M unique molecules). In MolCLR pre-training, we build molecule graphs and develop GNN encoders to learn differentiable representations. Three molecule graph augmentations are proposed: atom masking, bond deletion, and subgraph removal. A contrastive estimator maximizes the agreement of augmentations from the same molecule while minimizing the agreement of different molecules. Experiments show that our contrastive learning framework significantly improves the performance of GNNs on various molecular property benchmarks including both classification and regression tasks. Benefiting from pre-training on the large unlabeled database, MolCLR even achieves state-of-the-art on several challenging benchmarks after fine-tuning. Additionally, further investigations demonstrate that MolCLR learns to embed molecules into representations that can distinguish chemically reasonable molecular similarities.
    Missing Data Estimation in Temporal Multilayer Position-aware Graph Neural Network (TMP-GNN). (arXiv:2108.03400v2 [cs.LG] UPDATED)
    (2 min) GNNs have been proven to perform highly effective in various node-level, edge-level, and graph-level prediction tasks in several domains. Existing approaches mainly focus on static graphs. However, many graphs change over time with their edge may disappear, or node/edge attribute may alter from one time to the other. It is essential to consider such evolution in representation learning of nodes in time varying graphs. In this paper, we propose a Temporal Multi-layered Position-aware Graph Neural Network (TMP-GNN), a node embedding approach for dynamic graph that incorporates the interdependence of temporal relations into embedding computation. We evaluate the performance of TMP-GNN on two different representations of temporal multilayered graphs. The performance is assessed against the most popular GNNs on node-level prediction tasks. Then, we incorporate TMP-GNN into a deep learning framework to estimate missing data and compare the performance with their corresponding competent GNNs from our former experiment, and a baseline method. Experimental results on four real-world datasets yield up to 58% of lower ROC AUC for pairwise node classification task, and 96% of lower MAE in missing feature estimation, particularly for graphs with a relatively high number of nodes and lower mean degree of connectivity.
    Inverting Adversarially Robust Networks for Image Synthesis. (arXiv:2106.06927v2 [cs.CV] UPDATED)
    (2 min) Despite unconditional feature inverters being the foundation of many synthesis tasks, training them requires a large computational overhead, decoding capacity or additional autoregressive priors. We propose to train an adversarially robust encoder to learn disentangled and perceptually-aligned bottleneck features, making them easily invertible. Then, by training a simple generator with the mirror architecture of the encoder, we achieve superior reconstructions and generalization over standard approaches. We exploit such properties using an encoding-decoding network based on AR features and demonstrate its oustanding performance on three applications: anomaly detection, style transfer and image denoising. Comparisons against alternative learn-based methods show that our model attains improved performance with significantly less training parameters.
    Global Regular Network for Writer Identification. (arXiv:2201.05951v1 [cs.CV])
    (0 min) Writer identification has practical applications for forgery detection and forensic science. Most models based on deep neural networks extract features from character image or sub-regions in character image, which ignoring features contained in page-region image. Our proposed global regular network (GRN) pays attention to these features. GRN network consists of two branches: one branch takes page handwriting as input to extract global features, and the other takes word handwriting as input to extract local features. Global features and local features merge in a global residual way to form overall features of the handwriting. The proposed GRN has two attributions: one is adding a branch to extract features contained in page; the other is using residual attention network to extract local feature. Experiments demonstrate the effectiveness of both strategies. On CVL dataset, our models achieve impressive 99.98% top-1 accuracy and 100% top-5 accuracy with shorter training time and fewer network parameters, which exceeded the state-of-the-art structure. The experiment shows the powerful ability of the network in the field of writer identification. The source code is available at https://github.com/wangshiyu001/GRN.
    Light Field Networks: Neural Scene Representations with Single-Evaluation Rendering. (arXiv:2106.02634v2 [cs.CV] UPDATED)
    (0 min) Inferring representations of 3D scenes from 2D observations is a fundamental problem of computer graphics, computer vision, and artificial intelligence. Emerging 3D-structured neural scene representations are a promising approach to 3D scene understanding. In this work, we propose a novel neural scene representation, Light Field Networks or LFNs, which represent both geometry and appearance of the underlying 3D scene in a 360-degree, four-dimensional light field parameterized via a neural implicit representation. Rendering a ray from an LFN requires only a single network evaluation, as opposed to hundreds of evaluations per ray for ray-marching or volumetric based renderers in 3D-structured neural scene representations. In the setting of simple scenes, we leverage meta-learning to learn a prior over LFNs that enables multi-view consistent light field reconstruction from as little as a single image observation. This results in dramatic reductions in time and memory complexity, and enables real-time rendering. The cost of storing a 360-degree light field via an LFN is two orders of magnitude lower than conventional methods such as the Lumigraph. Utilizing the analytical differentiability of neural implicit representations and a novel parameterization of light space, we further demonstrate the extraction of sparse depth maps from LFNs.
    Weighting and Pruning based Ensemble Deep Random Vector Functional Link Network for Tabular Data Classification. (arXiv:2201.05809v1 [cs.LG])
    (0 min) In this paper, we first introduce batch normalization to the edRVFL network. This re-normalization method can help the network avoid divergence of the hidden features. Then we propose novel variants of Ensemble Deep Random Vector Functional Link (edRVFL). Weighted edRVFL (WedRVFL) uses weighting methods to give training samples different weights in different layers according to how the samples were classified confidently in the previous layer thereby increasing the ensemble's diversity and accuracy. Furthermore, a pruning-based edRVFL (PedRVFL) has also been proposed. We prune some inferior neurons based on their importance for classification before generating the next hidden layer. Through this method, we ensure that the randomly generated inferior features will not propagate to deeper layers. Subsequently, the combination of weighting and pruning, called Weighting and Pruning based Ensemble Deep Random Vector Functional Link Network (WPedRVFL), is proposed. We compare their performances with other state-of-the-art deep feedforward neural networks (FNNs) on 24 tabular UCI classification datasets. The experimental results illustrate the superior performance of our proposed methods.
    Transferability in Deep Learning: A Survey. (arXiv:2201.05867v1 [cs.LG])
    (0 min) The success of deep learning algorithms generally depends on large-scale data, while humans appear to have inherent ability of knowledge transfer, by recognizing and applying relevant knowledge from previous learning experiences when encountering and solving unseen tasks. Such an ability to acquire and reuse knowledge is known as transferability in deep learning. It has formed the long-term quest towards making deep learning as data-efficient as human learning, and has been motivating fruitful design of more powerful deep learning algorithms. We present this survey to connect different isolated areas in deep learning with their relation to transferability, and to provide a unified and complete view to investigating transferability through the whole lifecycle of deep learning. The survey elaborates the fundamental goals and challenges in parallel with the core principles and methods, covering recent cornerstones in deep architectures, pre-training, task adaptation and domain adaptation. This highlights unanswered questions on the appropriate objectives for learning transferable knowledge and for adapting the knowledge to new tasks and domains, avoiding catastrophic forgetting and negative transfer. Finally, we implement a benchmark and an open-source library, enabling a fair evaluation of deep learning methods in terms of transferability.
    Learning Graph Structures with Transformer for Multivariate Time Series Anomaly Detection in IoT. (arXiv:2104.03466v3 [cs.LG] UPDATED)
    (0 min) Many real-world IoT systems, which include a variety of internet-connected sensory devices, produce substantial amounts of multivariate time series data. Meanwhile, vital IoT infrastructures like smart power grids and water distribution networks are frequently targeted by cyber-attacks, making anomaly detection an important study topic. Modeling such relatedness is, nevertheless, unavoidable for any efficient and effective anomaly detection system, given the intricate topological and nonlinear connections that are originally unknown among sensors. Furthermore, detecting anomalies in multivariate time series is difficult due to their temporal dependency and stochasticity. This paper presented GTA, a new framework for multivariate time series anomaly detection that involves automatically learning a graph structure, graph convolution, and modeling temporal dependency using a Transformer-based architecture. The connection learning policy, which is based on the Gumbel-softmax sampling approach to learn bi-directed links among sensors directly, is at the heart of learning graph structure. To describe the anomaly information flow between network nodes, we introduced a new graph convolution called Influence Propagation convolution. In addition, to tackle the quadratic complexity barrier, we suggested a multi-branch attention mechanism to replace the original multi-head self-attention method. Extensive experiments on four publicly available anomaly detection benchmarks further demonstrate the superiority of our approach over alternative state-of-the-arts. Codes are available at https://github.com/ZEKAICHEN/GTA.
    Improved Reinforcement Learning in Cooperative Multi-agent Environments Using Knowledge Transfer. (arXiv:2107.09807v5 [cs.MA] UPDATED)
    (0 min) Nowadays, cooperative multi-agent systems are used to learn how to achieve goals in large-scale dynamic environments. However, learning in these environments is challenging: from the effect of search space size on learning time to inefficient cooperation among agents. Moreover, reinforcement learning algorithms may suffer from a long time of convergence in such environments. In this paper, a communication framework is introduced. In the proposed communication framework, agents learn to cooperate effectively and also by introduction of a new state calculation method the size of state space will decline considerably. Furthermore, a knowledge-transferring algorithm is presented to share the gained experiences among the different agents, and develop an effective knowledge-fusing mechanism to fuse the knowledge learnt utilizing the agents' own experiences with the knowledge received from other team members. Finally, the simulation results are provided to indicate the efficacy of the proposed method in the complex learning task. We have evaluated our approach on the shepherding problem and the results show that the learning process accelerates by making use of the knowledge transferring mechanism and the size of state space has declined by generating similar states based on state abstraction concept.
    UDC: Unified DNAS for Compressible TinyML Models. (arXiv:2201.05842v1 [cs.LG])
    (0 min) Emerging Internet-of-things (IoT) applications are driving deployment of neural networks (NNs) on heavily constrained low-cost hardware (HW) platforms, where accuracy is typically limited by memory capacity. To address this TinyML challenge, new HW platforms like neural processing units (NPUs) have support for model compression, which exploits aggressive network quantization and unstructured pruning optimizations. The combination of NPUs with HW compression and compressible models allows more expressive models in the same memory footprint. However, adding optimizations for compressibility on top of conventional NN architecture choices expands the design space across which we must make balanced trade-offs. This work bridges the gap between NPU HW capability and NN model design, by proposing a neural arcthiecture search (NAS) algorithm to efficiently search a large design space, including: network depth, operator type, layer width, bitwidth, sparsity, and more. Building on differentiable NAS (DNAS) with several key improvements, we demonstrate Unified DNAS for Compressible models (UDC) on CIFAR100, ImageNet, and DIV2K super resolution tasks. On ImageNet, we find Pareto dominant compressible models, which are 1.9x smaller or 5.76% more accurate.
    Robust uncertainty estimates with out-of-distribution pseudo-inputs training. (arXiv:2201.05890v1 [cs.LG])
    (0 min) Probabilistic models often use neural networks to control their predictive uncertainty. However, when making out-of-distribution (OOD)} predictions, the often-uncontrollable extrapolation properties of neural networks yield poor uncertainty predictions. Such models then don't know what they don't know, which directly limits their robustness w.r.t unexpected inputs. To counter this, we propose to explicitly train the uncertainty predictor where we are not given data to make it reliable. As one cannot train without data, we provide mechanisms for generating pseudo-inputs in informative low-density regions of the input space, and show how to leverage these in a practical Bayesian framework that casts a prior distribution over the model uncertainty. With a holistic evaluation, we demonstrate that this yields robust and interpretable predictions of uncertainty while retaining state-of-the-art performance on diverse tasks such as regression and generative modelling
    Transformers in Action:Weakly Supervised Action Segmentation. (arXiv:2201.05675v1 [cs.CV])
    (0 min) The video action segmentation task is regularly explored under weaker forms of supervision, such as transcript supervision, where a list of actions is easier to obtain than dense frame-wise labels. In this formulation, the task presents various challenges for sequence modeling approaches due to the emphasis on action transition points, long sequence lengths, and frame contextualization, making the task well-posed for transformers. Given developments enabling transformers to scale linearly, we demonstrate through our architecture how they can be applied to improve action alignment accuracy over the equivalent RNN-based models with the attention mechanism focusing around salient action transition regions. Additionally, given the recent focus on inference-time transcript selection, we propose a supplemental transcript embedding approach to select transcripts more quickly at inference-time. Furthermore, we subsequently demonstrate how this approach can also improve the overall segmentation performance. Finally, we evaluate our proposed methods across the benchmark datasets to better understand the applicability of transformers and the importance of transcript selection on this video-driven weakly-supervised task.
    Moses: Efficient Exploitation of Cross-device Transferable Features for Tensor Program Optimization. (arXiv:2201.05752v1 [cs.LG])
    (0 min) Achieving efficient execution of machine learning models has attracted significant attention recently. To generate tensor programs efficiently, a key component of DNN compilers is the cost model that can predict the performance of each configuration on specific devices. However, due to the rapid emergence of hardware platforms, it is increasingly labor-intensive to train domain-specific predictors for every new platform. Besides, current design of cost models cannot provide transferable features between different hardware accelerators efficiently and effectively. In this paper, we propose Moses, a simple and efficient design based on the lottery ticket hypothesis, which fully takes advantage of the features transferable to the target device via domain adaptation. Compared with state-of-the-art approaches, Moses achieves up to 1.53X efficiency gain in the search stage and 1.41X inference speedup on challenging DNN benchmarks.
    Training Fair Deep Neural Networks by Balancing Influence. (arXiv:2201.05759v1 [cs.LG])
    (0 min) Most fair machine learning methods either highly rely on the sensitive information of the training samples or require a large modification on the target models, which hinders their practical application. To address this issue, we propose a two-stage training algorithm named FAIRIF. It minimizes the loss over the reweighted data set (second stage) where the sample weights are computed to balance the model performance across different demographic groups (first stage). FAIRIF can be applied on a wide range of models trained by stochastic gradient descent without changing the model, while only requiring group annotations on a small validation set to compute sample weights. Theoretically, we show that, in the classification setting, three notions of disparity among different groups can be mitigated by training with the weights. Experiments on synthetic data sets demonstrate that FAIRIF yields models with better fairness-utility trade-offs against various types of bias; and on real-world data sets, we show the effectiveness and scalability of FAIRIF. Moreover, as evidenced by the experiments with pretrained models, FAIRIF is able to alleviate the unfairness issue of pretrained models without hurting their performance.
    Imputing Missing Observations with Time Sliced Synthetic Minority Oversampling Technique. (arXiv:2201.05634v1 [cs.LG])
    (0 min) We present a simple yet novel time series imputation technique with the goal of constructing an irregular time series that is uniform across every sample in a data set. Specifically, we fix a grid defined by the midpoints of non-overlapping bins (dubbed "slices") of observation times and ensure that each sample has values for all of the features at that given time. This allows one to both impute fully missing observations to allow uniform time series classification across the entire data and, in special cases, to impute individually missing features. To do so, we slightly generalize the well-known class imbalance algorithm SMOTE \cite{smote} to allow component wise nearest neighbor interpolation that preserves correlations when there are no missing features. We visualize the method in the simplified setting of 2-dimensional uncoupled harmonic oscillators. Next, we use tSMOTE to train an Encoder/Decoder long-short term memory (LSTM) model with Logistic Regression for predicting and classifying distinct trajectories of different 2D oscillators. After illustrating the the utility of tSMOTE in this context, we use the same architecture to train a clinical model for COVID-19 disease severity on an imputed data set. Our experiments show an improvement over standard mean and median imputation techniques by allowing a wider class of patient trajectories to be recognized by the model, as well as improvement over aggregated classification models.
    Discrete Simulation Optimization for Tuning Machine Learning Method Hyperparameters. (arXiv:2201.05978v1 [cs.LG])
    (0 min) Machine learning methods are being increasingly used in most technical areas such as image recognition, product recommendation, financial analysis, medical diagnosis, and predictive maintenance. The key question that arises is: how do we control the learning process according to our requirement for the problem? Hyperparameter tuning is used to choose the optimal set of hyperparameters for controlling the learning process of a model. Selecting the appropriate hyperparameters directly impacts the performance measure a model. We have used simulation optimization using discrete search methods like ranking and selection (R&S) methods such as the KN method and stochastic ruler method and its variations for hyperparameter optimization and also developed the theoretical basis for applying common R&S methods. The KN method finds the best possible system with statistical guarantee and stochastic ruler method asymptotically converges to the optimal solution and is also computationally very efficient. We also benchmarked our results with state of art hyperparameter optimization libraries such as $hyperopt$ and $mango$ and found KN and stochastic ruler to be performing consistently better than $hyperopt~rand$ and stochastic ruler to be equally efficient in comparison with $hyperopt~tpe$ in most cases, even when our computational implementations are not yet optimized in comparison to professional packages.
    Convolutional Signature for Sequential Data. (arXiv:2009.06719v2 [stat.ML] UPDATED)
    (0 min) Signature is an infinite graded sequence of statistics known to characterize geometric rough paths, which includes the paths with bounded variation. This object has been studied successfully for machine learning with mostly applications in low dimensional cases. In the high dimensional case, it suffers from exponential growth in the number of features in truncated signature transform. We propose a novel neural network based model which borrows the idea from Convolutional Neural Network to address this problem. Our model reduces the number of features efficiently in a data dependent way. Some empirical experiments are provided to support our model.
    Deep Optimal Transport on SPD Manifolds for Domain Adaptation. (arXiv:2201.05745v1 [cs.LG])
    (0 min) The domain adaption (DA) problem on symmetric positive definite (SPD) manifolds has raised interest in the machine learning community because of the growing potential for the SPD-matrix representations across many non-stationary applicable scenarios. This paper generalizes the joint distribution adaption (JDA) to align the source and target domains on SPD manifolds and proposes a deep network architecture, Deep Optimal Transport (DOT), using the generalized JDA and the existing deep network architectures on SPD manifolds. The specific architecture in DOT enables it to learn an approximate optimal transport (OT) solution to the DA problems on SPD manifolds. In the experiments, DOT exhibits a 2.32% and 2.92% increase on the average accuracy in two highly non-stationary cross-session scenarios in brain-computer interfaces (BCIs), respectively. The visualizational results of the source and target domains before and after the transformation also demonstrate the validity of DOT.
    Tools and Practices for Responsible AI Engineering. (arXiv:2201.05647v1 [cs.LG])
    (0 min) Responsible Artificial Intelligence (AI) - the practice of developing, evaluating, and maintaining accurate AI systems that also exhibit essential properties such as robustness and explainability - represents a multifaceted challenge that often stretches standard machine learning tooling, frameworks, and testing methods beyond their limits. In this paper, we present two new software libraries - hydra-zen and the rAI-toolbox - that address critical needs for responsible AI engineering. hydra-zen dramatically simplifies the process of making complex AI applications configurable, and their behaviors reproducible. The rAI-toolbox is designed to enable methods for evaluating and enhancing the robustness of AI-models in a way that is scalable and that composes naturally with other popular ML frameworks. We describe the design principles and methodologies that make these tools effective, including the use of property-based testing to bolster the reliability of the tools themselves. Finally, we demonstrate the composability and flexibility of the tools by showing how various use cases from adversarial robustness and explainable AI can be concisely implemented with familiar APIs.
    Zero-Shot Machine Unlearning. (arXiv:2201.05629v1 [cs.LG])
    (0 min) With the introduction of new privacy regulations, machine unlearning is becoming an emerging research problem due to an increasing need for regulatory compliance required for machine learning (ML) applications. Modern privacy regulations grant citizens the right to be forgotten by products, services and companies. This necessitates deletion of data not only from storage archives but also from ML model. The right to be forgotten requests come in the form of removal of a certain set or class of data from the already trained ML model. Practical considerations preclude retraining of the model from scratch minus the deleted data. The few existing studies use the whole training data, or a subset of training data, or some metadata stored during training to update the model weights for unlearning. However, strict regulatory compliance requires time-bound deletion of data. Thus, in many cases, no data related to the training process or training samples may be accessible even for the unlearning purpose. We therefore ask the question: is it possible to achieve unlearning with zero training samples? In this paper, we introduce the novel problem of zero-shot machine unlearning that caters for the extreme but practical scenario where zero original data samples are available for use. We then propose two novel solutions for zero-shot machine unlearning based on (a) error minimizing-maximizing noise and (b) gated knowledge transfer. We also introduce a new evaluation metric, Anamnesis Index (AIN) to effectively measure the quality of the unlearning method. The experiments show promising results for unlearning in deep learning models on benchmark vision data-sets. The source code will be made publicly available.
    Smart Parking Space Detection under Hazy conditions using Convolutional Neural Networks: A Novel Approach. (arXiv:2201.05858v1 [cs.CV])
    (0 min) Limited urban parking space combined with urbanization has necessitated the development of smart parking systems that can communicate the availability of parking slots to the end users. Towards this, various deep learning based solutions using convolutional neural networks have been proposed for parking space occupation detection. Though these approaches are robust to partial obstructions and lighting conditions, their performance is found to degrade in the presence of haze conditions. Looking in this direction, this paper investigates the use of dehazing networks that improves the performance of parking space occupancy classifier under hazy conditions. Additionally, training procedures are proposed for dehazing networks to maximize the performance of the system on both hazy and non-hazy conditions. The proposed system is deployable as part of existing smart parking systems where limited number of cameras are used to monitor hundreds of parking spaces. To validate our approach, we have developed a custom hazy parking system dataset from real-world task-driven test set of RESIDE-\b{eta} dataset. The proposed approach is tested against existing state-of-the-art parking space detectors on CNRPark-EXT and hazy parking system datasets. Experimental results indicate that there is a significant accuracy improvement of the proposed approach on the hazy parking system dataset.
    GearNet: Stepwise Dual Learning for Weakly Supervised Domain Adaptation. (arXiv:2201.06001v1 [cs.LG])
    (0 min) This paper studies weakly supervised domain adaptation(WSDA) problem, where we only have access to the source domain with noisy labels, from which we need to transfer useful information to the unlabeled target domain. Although there have been a few studies on this problem, most of them only exploit unidirectional relationships from the source domain to the target domain. In this paper, we propose a universal paradigm called GearNet to exploit bilateral relationships between the two domains. Specifically, we take the two domains as different inputs to train two models alternately, and asymmetrical Kullback-Leibler loss is used for selectively matching the predictions of the two models in the same domain. This interactive learning schema enables implicit label noise canceling and exploits correlations between the source and target domains. Therefore, our GearNet has the great potential to boost the performance of a wide range of existing WSDL methods. Comprehensive experimental results show that the performance of existing methods can be significantly improved by equipping with our GearNet.
    Deep learning of multi-resolution X-Ray micro-CT images for multi-scale modelling. (arXiv:2111.01270v2 [physics.geo-ph] UPDATED)
    (0 min) Field-of-view and resolution trade-offs in X-Ray micro-computed tomography (micro-CT) imaging limit the characterization, analysis and model development of multi-scale porous systems. To this end, we developed an applied methodology utilising deep learning to enhance low resolution images over large sample sizes and create multi-scale models capable of accurately simulating experimental fluid dynamics from the pore (microns) to continuum (centimetres) scale. We develop a 3D Enhanced Deep Super Resolution (EDSR) convolutional neural network to create super resolution (SR) images from low resolution images, which alleviates common micro-CT hardware/reconstruction defects in high-resolution (HR) images. When paired with pore-network simulations and parallel computation, we can create large 3D continuum-scale models with spatially varying flow & material properties. We quantitatively validate the workflow at various scales using direct HR/SR image similarity, pore-scale material/flow simulations and continuum scale multiphase flow experiments (drainage immiscible flow pressures and 3D fluid volume fractions). The SR images and models are comparable to the HR ground truth, and generally accurate to within experimental uncertainty at the continuum scale across a range of flow rates. They are found to be significantly more accurate than their LR counterparts, especially in cases where a wide distribution of pore-sizes are encountered. The applied methodology opens up the possibility to image, model and analyse truly multi-scale heterogeneous systems that are otherwise intractable.
    Variance-Reduced Heterogeneous Federated Learning via Stratified Client Selection. (arXiv:2201.05762v1 [cs.LG])
    (0 min) Client selection strategies are widely adopted to handle the communication-efficient problem in recent studies of Federated Learning (FL). However, due to the large variance of the selected subset's update, prior selection approaches with a limited sampling ratio cannot perform well on convergence and accuracy in heterogeneous FL. To address this problem, in this paper, we propose a novel stratified client selection scheme to reduce the variance for the pursuit of better convergence and higher accuracy. Specifically, to mitigate the impact of heterogeneity, we develop stratification based on clients' local data distribution to derive approximate homogeneous strata for better selection in each stratum. Concentrating on a limited sampling ratio scenario, we next present an optimized sample size allocation scheme by considering the diversity of stratum's variability, with the promise of further variance reduction. Theoretically, we elaborate the explicit relation among different selection schemes with regard to variance, under heterogeneous settings, we demonstrate the effectiveness of our selection scheme. Experimental results confirm that our approach not only allows for better performance relative to state-of-the-art methods but also is compatible with prevalent FL algorithms.
    Hyperplane bounds for neural feature mappings. (arXiv:2201.05799v1 [cs.LG])
    (0 min) Deep learning methods minimise the empirical risk using loss functions such as the cross entropy loss. When minimising the empirical risk, the generalisation of the learnt function still depends on the performance on the training data, the Vapnik-Chervonenkis(VC)-dimension of the function and the number of training examples. Neural networks have a large number of parameters, which correlates with their VC-dimension that is typically large but not infinite, and typically a large number of training instances are needed to effectively train them. In this work, we explore how to optimize feature mappings using neural network with the intention to reduce the effective VC-dimension of the hyperplane found in the space generated by the mapping. An interpretation of the results of this study is that it is possible to define a loss that controls the VC-dimension of the separating hyperplane. We evaluate this approach and observe that the performance when using this method improves when the size of the training set is small.
    Graph Transfer Learning via Adversarial Domain Adaptation with Graph Convolution. (arXiv:1909.01541v2 [cs.LG] UPDATED)
    (0 min) This paper studies the problem of cross-network node classification to overcome the insufficiency of labeled data in a single network. It aims to leverage the label information in a partially labeled source network to assist node classification in a completely unlabeled or partially labeled target network. Existing methods for single network learning cannot solve this problem due to the domain shift across networks. Some multi-network learning methods heavily rely on the existence of cross-network connections, thus are inapplicable for this problem. To tackle this problem, we propose a novel graph transfer learning framework AdaGCN by leveraging the techniques of adversarial domain adaptation and graph convolution. It consists of two components: a semi-supervised learning component and an adversarial domain adaptation component. The former aims to learn class discriminative node representations with given label information of the source and target networks, while the latter contributes to mitigating the distribution divergence between the source and target domains to facilitate knowledge transfer. Extensive empirical evaluations on real-world datasets show that AdaGCN can successfully transfer class information with a low label rate on the source network and a substantial divergence between the source and target domains.
    Cautious Policy Programming: Exploiting KL Regularization in Monotonic Policy Improvement for Reinforcement Learning. (arXiv:2107.05798v3 [cs.LG] UPDATED)
    (0 min) In this paper, we propose cautious policy programming (CPP), a novel value-based reinforcement learning (RL) algorithm that can ensure monotonic policy improvement during learning. Based on the nature of entropy-regularized RL, we derive a new entropy regularization-aware lower bound of policy improvement that only requires estimating the expected policy advantage function. CPP leverages this lower bound as a criterion for adjusting the degree of a policy update for alleviating policy oscillation. Different from similar algorithms that are mostly theory-oriented, we also propose a novel interpolation scheme that makes CPP better scale in high dimensional control problems. We demonstrate that the proposed algorithm can trade o? performance and stability in both didactic classic control problems and challenging high-dimensional Atari games.
    An efficient aggregation method for the symbolic representation of temporal data. (arXiv:2201.05697v1 [cs.LG])
    (0 min) Symbolic representations are a useful tool for the dimension reduction of temporal data, allowing for the efficient storage of and information retrieval from time series. They can also enhance the training of machine learning algorithms on time series data through noise reduction and reduced sensitivity to hyperparameters. The adaptive Brownian bridge-based aggregation (ABBA) method is one such effective and robust symbolic representation, demonstrated to accurately capture important trends and shapes in time series. However, in its current form the method struggles to process very large time series. Here we present a new variant of the ABBA method, called fABBA. This variant utilizes a new aggregation approach tailored to the piecewise representation of time series. By replacing the k-means clustering used in ABBA with a sorting-based aggregation technique, and thereby avoiding repeated sum-of-squares error computations, the computational complexity is significantly reduced. In contrast to the original method, the new approach does not require the number of time series symbols to be specified in advance. Through extensive tests we demonstrate that the new method significantly outperforms ABBA with a considerable reduction in runtime while also outperforming the popular SAX and 1d-SAX representations in terms of reconstruction accuracy. We further demonstrate that fABBA can compress other data types such as images.
    Deep Reinforcement Learning for Shared Autonomous Vehicles (SAV) Fleet Management. (arXiv:2201.05720v1 [cs.LG])
    (0 min) Shared Automated Vehicles (SAVs) Fleets companies are starting pilot projects nationwide. In 2020 in Fairfax Virginia it was announced the first Shared Autonomous Vehicle Fleet pilot project in Virginia. SAVs promise to improve quality of life. However, SAVs will also induce some negative externalities by generating excessive vehicle miles traveled (VMT), which leads to more congestions, energy consumption, and emissions. The excessive VMT are primarily generated via empty relocation process. Reinforcement Learning based algorithms are being researched as a possible solution to solve some of these problems: most notably minimizing waiting time for riders. But no research using Reinforcement Learning has been made about reducing parking space cost nor reducing empty cruising time. This study explores different \textbf{Reinforcement Learning approaches and then decide the best approach to help minimize the rider waiting time, parking cost, and empty travel
    Formula graph self-attention network for representation-domain independent materials discovery. (arXiv:2201.05649v1 [cs.LG])
    (0 min) The success of machine learning (ML) in materials property prediction depends heavily on how the materials are represented for learning. Two dominant families of material descriptors exist, one that encodes crystal structure in the representation and the other that only uses stoichiometric information with the hope of discovering new materials. Graph neural networks (GNNs) in particular have excelled in predicting material properties within chemical accuracy. However, current GNNs are limited to only one of the above two avenues owing to the little overlap between respective material representations. Here, we introduce a new concept of formula graph which unifies both stoichiometry-only and structure-based material descriptors. We further develop a self-attention integrated GNN that assimilates a formula graph and show that the proposed architecture produces material embeddings transferable between the two domains. Our model substantially outperforms previous structure-based GNNs as well as structure-agnostic counterparts while exhibiting better sample efficiency and faster convergence. Finally, the model is applied in a challenging exemplar to predict the complex dielectric function of materials and nominate new substances that potentially exhibit epsilon-near-zero phenomena.
    SGLB: Stochastic Gradient Langevin Boosting. (arXiv:2001.07248v5 [cs.LG] UPDATED)
    (0 min) This paper introduces Stochastic Gradient Langevin Boosting (SGLB) - a powerful and efficient machine learning framework that may deal with a wide range of loss functions and has provable generalization guarantees. The method is based on a special form of the Langevin diffusion equation specifically designed for gradient boosting. This allows us to theoretically guarantee the global convergence even for multimodal loss functions, while standard gradient boosting algorithms can guarantee only local optimum. We also empirically show that SGLB outperforms classic gradient boosting when applied to classification tasks with 0-1 loss function, which is known to be multimodal.
    Real-World Graph Convolution Networks (RW-GCNs) for Action Recognition in Smart Video Surveillance. (arXiv:2201.05739v1 [cs.CV])
    (0 min) Action recognition is a key algorithmic part of emerging on-the-edge smart video surveillance and security systems. Skeleton-based action recognition is an attractive approach which, instead of using RGB pixel data, relies on human pose information to classify appropriate actions. However, existing algorithms often assume ideal conditions that are not representative of real-world limitations, such as noisy input, latency requirements, and edge resource constraints. To address the limitations of existing approaches, this paper presents Real-World Graph Convolution Networks (RW-GCNs), an architecture-level solution for meeting the domain constraints of Real World Skeleton-based Action Recognition. Inspired by the presence of feedback connections in the human visual cortex, RW-GCNs leverage attentive feedback augmentation on existing near state-of-the-art (SotA) Spatial-Temporal Graph Convolution Networks (ST-GCNs). The ST-GCNs' design choices are derived from information theory-centric principles to address both the spatial and temporal noise typically encountered in end-to-end real-time and on-the-edge smart video systems. Our results demonstrate RW-GCNs' ability to serve these applications by achieving a new SotA accuracy on the NTU-RGB-D-120 dataset at 94.1%, and achieving 32X less latency than baseline ST-GCN applications while still achieving 90.4% accuracy on the Northwestern UCLA dataset in the presence of spatial keypoint noise. RW-GCNs further show system scalability by running on the 10X cost effective NVIDIA Jetson Nano (as opposed to NVIDIA Xavier NX), while still maintaining a respectful range of throughput (15.6 to 5.5 Actions per Second) on the resource constrained device. The code is available here: https://github.com/TeCSAR-UNCC/RW-GCN.
    Recursive Least Squares Advantage Actor-Critic Algorithms. (arXiv:2201.05918v1 [cs.LG])
    (0 min) As an important algorithm in deep reinforcement learning, advantage actor critic (A2C) has been widely succeeded in both discrete and continuous control tasks with raw pixel inputs, but its sample efficiency still needs to improve more. In traditional reinforcement learning, actor-critic algorithms generally use the recursive least squares (RLS) technology to update the parameter of linear function approximators for accelerating their convergence speed. However, A2C algorithms seldom use this technology to train deep neural networks (DNNs) for improving their sample efficiency. In this paper, we propose two novel RLS-based A2C algorithms and investigate their performance. Both proposed algorithms, called RLSSA2C and RLSNA2C, use the RLS method to train the critic network and the hidden layers of the actor network. The main difference between them is at the policy learning step. RLSSA2C uses an ordinary first-order gradient descent algorithm and the standard policy gradient to learn the policy parameter. RLSNA2C uses the Kronecker-factored approximation, the RLS method and the natural policy gradient to learn the compatible parameter and the policy parameter. In addition, we analyze the complexity and convergence of both algorithms, and present three tricks for further improving their convergence speed. Finally, we demonstrate the effectiveness of both algorithms on 40 games in the Atari 2600 environment and 11 tasks in the MuJoCo environment. From the experimental results, it is shown that our both algorithms have better sample efficiency than the vanilla A2C on most games or tasks, and have higher computational efficiency than other two state-of-the-art algorithms.
    Robust Learning-based Predictive Control for Discrete-time Nonlinear Systems with Unknown Dynamics and State Constraints. (arXiv:1911.09827v4 [eess.SY] UPDATED)
    (0 min) Robust model predictive control (MPC) is a well-known control technique for model-based control with constraints and uncertainties. In classic robust tube-based MPC approaches, an open-loop control sequence is computed via periodically solving an online nominal MPC problem, which requires prior model information and frequent access to onboard computational resources. In this paper, we propose an efficient robust MPC solution based on receding horizon reinforcement learning, called r-LPC, for unknown nonlinear systems with state constraints and disturbances. The proposed r-LPC utilizes a Koopman operator-based prediction model obtained off-line from pre-collected input-output datasets. Unlike classic tube-based MPC, in each prediction time interval of r-LPC, we use an actor-critic structure to learn a near-optimal feedback control policy rather than a control sequence. The resulting closed-loop control policy can be learned off-line and deployed online or learned online in an asynchronous way. In the latter case, online learning can be activated whenever necessary; for instance, the safety constraint is violated with the deployed policy. The closed-loop recursive feasibility, robustness, and asymptotic stability are proven under function approximation errors of the actor-critic networks. Simulation and experimental results on two nonlinear systems with unknown dynamics and disturbances have demonstrated that our approach has better or comparable performance when compared with tube-based MPC and LQR, and outperforms a recently developed actor-critic learning approach.
    ATOM3D: Tasks On Molecules in Three Dimensions. (arXiv:2012.04035v4 [cs.LG] UPDATED)
    (0 min) Computational methods that operate on three-dimensional molecular structure have the potential to solve important questions in biology and chemistry. In particular, deep neural networks have gained significant attention, but their widespread adoption in the biomolecular domain has been limited by a lack of either systematic performance benchmarks or a unified toolkit for interacting with molecular data. To address this, we present ATOM3D, a collection of both novel and existing benchmark datasets spanning several key classes of biomolecules. We implement several classes of three-dimensional molecular learning methods for each of these tasks and show that they consistently improve performance relative to methods based on one- and two-dimensional representations. The specific choice of architecture proves to be critical for performance, with three-dimensional convolutional networks excelling at tasks involving complex geometries, graph networks performing well on systems requiring detailed positional information, and the more recently developed equivariant networks showing significant promise. Our results indicate that many molecular problems stand to gain from three-dimensional molecular learning, and that there is potential for improvement on many tasks which remain underexplored. To lower the barrier to entry and facilitate further developments in the field, we also provide a comprehensive suite of tools for dataset processing, model training, and evaluation in our open-source atom3d Python package. All datasets are available for download from https://www.atom3d.ai .
    Exposing Query Identification for Search Transparency. (arXiv:2110.07701v2 [cs.IR] UPDATED)
    (0 min) Search systems control the exposure of ranked content to searchers. In many cases, creators value not only the exposure of their content but, moreover, an understanding of the specific searches where the content is surfaced. The problem of identifying which queries expose a given piece of content in the ranking results is an important and relatively under-explored search transparency challenge. Exposing queries are useful for quantifying various issues of search bias, privacy, data protection, security, and search engine optimization. Exact identification of exposing queries in a given system is computationally expensive, especially in dynamic contexts such as web search. We explore the feasibility of approximate exposing query identification (EQI) as a retrieval task by reversing the role of queries and documents in two classes of search systems: dense dual-encoder models and traditional BM25 models. We then propose how this approach can be improved through metric learning over the retrieval embedding space. We further derive an evaluation metric to measure the quality of a ranking of exposing queries, as well as conducting an empirical analysis focusing on various practical aspects of approximate EQI. Overall, our work contributes a novel conception of transparency in search systems and computational means of achieving it.
    HeterPS: Distributed Deep Learning With Reinforcement Learning Based Scheduling in Heterogeneous Environments. (arXiv:2111.10635v2 [cs.DC] UPDATED)
    (0 min) Deep neural networks (DNNs) exploit many layers and a large number of parameters to achieve excellent performance. The training process of DNN models generally handles large-scale input data with many sparse features, which incurs high Input/Output (IO) cost, while some layers are compute-intensive. The training process generally exploits distributed computing resources to reduce training time. In addition, heterogeneous computing resources, e.g., CPUs, GPUs of multiple types, are available for the distributed training process. Thus, the scheduling of multiple layers to diverse computing resources is critical for the training process. To efficiently train a DNN model using the heterogeneous computing resources, we propose a distributed framework, i.e., Paddle-Heterogeneous Parameter Server (Paddle-HeterPS), composed of a distributed architecture and a Reinforcement Learning (RL)-based scheduling method. The advantages of Paddle-HeterPS are three-fold compared with existing frameworks. First, Paddle-HeterPS enables efficient training process of diverse workloads with heterogeneous computing resources. Second, Paddle-HeterPS exploits an RL-based method to efficiently schedule the workload of each layer to appropriate computing resources to minimize the cost while satisfying throughput constraints. Third, Paddle-HeterPS manages data storage and data communication among distributed computing resources. We carry out extensive experiments to show that Paddle-HeterPS significantly outperforms state-of-the-art approaches in terms of throughput (14.5 times higher) and monetary cost (312.3% smaller). The codes of the framework are publicly available at: https://github.com/PaddlePaddle/Paddle.
    Differentially Private (Gradient) Expectation Maximization Algorithm with Statistical Guarantees. (arXiv:2010.13520v3 [cs.LG] UPDATED)
    (0 min) (Gradient) Expectation Maximization (EM) is a widely used algorithm for estimating the maximum likelihood of mixture models or incomplete data problems. A major challenge facing this popular technique is how to effectively preserve the privacy of sensitive data. Previous research on this problem has already lead to the discovery of some Differentially Private (DP) algorithms for (Gradient) EM. However, unlike in the non-private case, existing techniques are not yet able to provide finite sample statistical guarantees. To address this issue, we propose in this paper the first DP version of (Gradient) EM algorithm with statistical guarantees. Moreover, we apply our general framework to three canonical models: Gaussian Mixture Model (GMM), Mixture of Regressions Model (MRM) and Linear Regression with Missing Covariates (RMC). Specifically, for GMM in the DP model, our estimation error is near optimal in some cases. For the other two models, we provide the first finite sample statistical guarantees. Our theory is supported by thorough numerical experiments.
    Dataset Distillation with Infinitely Wide Convolutional Networks. (arXiv:2107.13034v3 [cs.LG] UPDATED)
    (0 min) The effectiveness of machine learning algorithms arises from being able to extract useful features from large amounts of data. As model and dataset sizes increase, dataset distillation methods that compress large datasets into significantly smaller yet highly performant ones will become valuable in terms of training efficiency and useful feature extraction. To that end, we apply a novel distributed kernel based meta-learning framework to achieve state-of-the-art results for dataset distillation using infinitely wide convolutional neural networks. For instance, using only 10 datapoints (0.02% of original dataset), we obtain over 65% test accuracy on CIFAR-10 image classification task, a dramatic improvement over the previous best test accuracy of 40%. Our state-of-the-art results extend across many other settings for MNIST, Fashion-MNIST, CIFAR-10, CIFAR-100, and SVHN. Furthermore, we perform some preliminary analyses of our distilled datasets to shed light on how they differ from naturally occurring data.
    Large-Scale Inventory Optimization: A Recurrent-Neural-Networks-Inspired Simulation Approach. (arXiv:2201.05868v1 [math.OC])
    (0 min) Many large-scale production networks include thousands types of final products and tens to hundreds thousands types of raw materials and intermediate products. These networks face complicated inventory management decisions, which are often too complicated for inventory models and too large for simulation models. In this paper, by combing efficient computational tools of recurrent neural networks (RNN) and the structural information of production networks, we propose a RNN inspired simulation approach that may be thousands times faster than existing simulation approach and is capable of solving large-scale inventory optimization problems in a reasonable amount of time.
    Attention Approximates Sparse Distributed Memory. (arXiv:2111.05498v2 [cs.LG] UPDATED)
    (0 min) While Attention has come to be an important mechanism in deep learning, there remains limited intuition for why it works so well. Here, we show that Transformer Attention can be closely related under certain data conditions to Kanerva's Sparse Distributed Memory (SDM), a biologically plausible associative memory model. We confirm that these conditions are satisfied in pre-trained GPT2 Transformer models. We discuss the implications of the Attention-SDM map and provide new computational and biological interpretations of Attention.
    Self-Supervised GANs with Label Augmentation. (arXiv:2106.08601v5 [cs.LG] UPDATED)
    (0 min) Recently, transformation-based self-supervised learning has been applied to generative adversarial networks (GANs) to mitigate catastrophic forgetting in the discriminator by introducing a stationary learning environment. However, the separate self-supervised tasks in existing self-supervised GANs cause a goal inconsistent with generative modeling due to the fact that their self-supervised classifiers are agnostic to the generator distribution. To address this problem, we propose a novel self-supervised GAN that unifies the GAN task with the self-supervised task by augmenting the GAN labels (real or fake) via self-supervision of data transformation. Specifically, the original discriminator and self-supervised classifier are unified into a label-augmented discriminator that predicts the augmented labels to be aware of both the generator distribution and the data distribution under every transformation, and then provide the discrepancy between them to optimize the generator. Theoretically, we prove that the optimal generator could converge to replicate the real data distribution. Empirically, we show that the proposed method significantly outperforms previous self-supervised and data augmentation GANs on both generative modeling and representation learning across benchmark datasets.
    A Novel Multi-Task Learning Method for Symbolic Music Emotion Recognition. (arXiv:2201.05782v1 [cs.SD])
    (0 min) Symbolic Music Emotion Recognition(SMER) is to predict music emotion from symbolic data, such as MIDI and MusicXML. Previous work mainly focused on learning better representation via (mask) language model pre-training but ignored the intrinsic structure of the music, which is extremely important to the emotional expression of music. In this paper, we present a simple multi-task framework for SMER, which incorporates the emotion recognition task with other emotion-related auxiliary tasks derived from the intrinsic structure of the music. The results show that our multi-task framework can be adapted to different models. Moreover, the labels of auxiliary tasks are easy to be obtained, which means our multi-task methods do not require manually annotated labels other than emotion. Conducting on two publicly available datasets (EMOPIA and VGMIDI), the experiments show that our methods perform better in SMER task. Specifically, accuracy has been increased by 4.17 absolute point to 67.58 in EMOPIA dataset, and 1.97 absolute point to 55.85 in VGMIDI dataset. Ablation studies also show the effectiveness of multi-task methods designed in this paper.
    Incorporating Multiple Cluster Centers for Multi-Label Learning. (arXiv:2004.08113v3 [cs.LG] UPDATED)
    (0 min) Multi-label learning deals with the problem that each instance is associated with multiple labels simultaneously. Most of the existing approaches aim to improve the performance of multi-label learning by exploiting label correlations. Although the data augmentation technique is widely used in many machine learning tasks, it is still unclear whether data augmentation is helpful to multi-label learning. In this article, we propose to leverage the data augmentation technique to improve the performance of multi-label learning. Specifically, we first propose a novel data augmentation approach that performs clustering on the real examples and treats the cluster centers as virtual examples, and these virtual examples naturally embody the local label correlations and label importances. Then, motivated by the cluster assumption that examples in the same cluster should have the same label, we propose a novel regularization term to bridge the gap between the real examples and virtual examples, which can promote the local smoothness of the learning function. Extensive experimental results on a number of real-world multi-label datasets clearly demonstrate that our proposed approach outperforms the state-of-the-art counterparts.
    Enhanced Membership Inference Attacks against Machine Learning Models. (arXiv:2111.09679v2 [cs.LG] UPDATED)
    (0 min) How much does a given trained model leak about each individual data record in its training set? Membership inference attacks are used as an auditing tool to quantify the private information that a model leaks about the individual data points in its training set. Membership inference attacks are influenced by different uncertainties that an attacker has to resolve about training data, the training algorithm, and the underlying data distribution. Thus attack success rates, of many attacks in the literature, do not precisely capture the information leakage of models about their data, as they also reflect other uncertainties that the attack algorithm has. In this paper, we explain the implicit assumptions and also the simplifications made in prior work using the framework of hypothesis testing. We also derive new attack algorithms from the framework that can achieve a high AUC score while also highlighting the different factors that affect their performance. Our algorithms capture a very precise approximation of privacy loss in models, and can be used as a tool to perform an accurate and informed estimation of privacy risk in machine learning models. We provide a thorough empirical evaluation of our attack strategies on various machine learning tasks and benchmark datasets.
    Mixed Diagnostics for Longitudinal Properties of Electron Bunches in a Free-Electron Laser. (arXiv:2201.05769v1 [physics.acc-ph])
    (0 min) Longitudinal properties of electron bunches are critical for the performance of a wide range of scientific facilities. In a free-electron laser, for example, the existing diagnostics only provide very limited longitudinal information of the electron bunch during online tuning and optimization. We leverage the power of artificial intelligence to build a neural network model using experimental data, in order to bring the destructive longitudinal phase space (LPS) diagnostics online virtually and improve the existing current profile online diagnostics which uses a coherent transition radiation (CTR) spectrometer. The model can also serve as a digital twin of the real machine on which algorithms can be tested efficiently and effectively. We demonstrate at the FLASH facility that the encoder-decoder model with more than one decoder can make highly accurate predictions of megapixel LPS images and coherent transition radiation spectra concurrently for electron bunches in a bunch train with broad ranges of LPS shapes and peak currents, which are obtained by scanning all the major control knobs for LPS manipulation. Furthermore, we propose a way to significantly improve the CTR spectrometer online measurement by combining the predicted and measured spectra. Our work showcases how to combine virtual and real diagnostics in order to provide heterogeneous and reliable mixed diagnostics for scientific facilities.
    Reliable Causal Discovery with Improved Exact Search and Weaker Assumptions. (arXiv:2201.05666v1 [cs.LG])
    (0 min) Many of the causal discovery methods rely on the faithfulness assumption to guarantee asymptotic correctness. However, the assumption can be approximately violated in many ways, leading to sub-optimal solutions. Although there is a line of research in Bayesian network structure learning that focuses on weakening the assumption, such as exact search methods with well-defined score functions, they do not scale well to large graphs. In this work, we introduce several strategies to improve the scalability of exact score-based methods in the linear Gaussian setting. In particular, we develop a super-structure estimation method based on the support of inverse covariance matrix which requires assumptions that are strictly weaker than faithfulness, and apply it to restrict the search space of exact search. We also propose a local search strategy that performs exact search on the local clusters formed by each variable and its neighbors within two hops in the super-structure. Numerical experiments validate the efficacy of the proposed procedure, and demonstrate that it scales up to hundreds of nodes with a high accuracy.
    Edge-based Tensor prediction via graph neural networks. (arXiv:2201.05770v1 [cond-mat.mtrl-sci])
    (0 min) Message-passing neural networks (MPNN) have shown extremely high efficiency and accuracy in predicting the physical properties of molecules and crystals, and are expected to become the next-generation material simulation tool after the density functional theory (DFT). However, there is currently a lack of a general MPNN framework for directly predicting the tensor properties of the crystals. In this work, a general framework for the prediction of tensor properties was proposed: the tensor property of a crystal can be decomposed into the average of the tensor contributions of all the atoms in the crystal, and the tensor contribution of each atom can be expanded as the sum of the tensor projections in the directions of the edges connecting the atoms. On this basis, the edge-based expansions of force vectors, Born effective charges (BECs), dielectric (DL) and piezoelectric (PZ) tensors were proposed. These expansions are rotationally equivariant, while the coefficients in these tensor expansions are rotationally invariant scalars which are similar to physical quantities such as formation energy and band gap. The advantage of this tensor prediction framework is that it does not require the network itself to be equivariant. Therefore, in this work, we directly designed the edge-based tensor prediction graph neural network (ETGNN) model on the basis of the invariant graph neural network to predict tensors. The validity and high precision of this tensor prediction framework were shown by the tests of ETGNN on the extended systems, random perturbed structures and JARVIS-DFT datasets. This tensor prediction framework is general for nearly all the GNNs and can achieve higher accuracy with more advanced GNNs in the future.
    Recent Progress in the CUHK Dysarthric Speech Recognition System. (arXiv:2201.05845v1 [eess.AS])
    (0 min) Despite the rapid progress of automatic speech recognition (ASR) technologies in the past few decades, recognition of disordered speech remains a highly challenging task to date. Disordered speech presents a wide spectrum of challenges to current data intensive deep neural networks (DNNs) based ASR technologies that predominantly target normal speech. This paper presents recent research efforts at the Chinese University of Hong Kong (CUHK) to improve the performance of disordered speech recognition systems on the largest publicly available UASpeech dysarthric speech corpus. A set of novel modelling techniques including neural architectural search, data augmentation using spectra-temporal perturbation, model based speaker adaptation and cross-domain generation of visual features within an audio-visual speech recognition (AVSR) system framework were employed to address the above challenges. The combination of these techniques produced the lowest published word error rate (WER) of 25.21% on the UASpeech test set 16 dysarthric speakers, and an overall WER reduction of 5.4% absolute (17.6% relative) over the CUHK 2018 dysarthric speech recognition system featuring a 6-way DNN system combination and cross adaptation of out-of-domain normal speech data trained systems. Bayesian model adaptation further allows rapid adaptation to individual dysarthric speakers to be performed using as little as 3.06 seconds of speech. The efficacy of these techniques were further demonstrated on a CUDYS Cantonese dysarthric speech recognition task.
    Tight Regret Bounds for Noisy Optimization of a Brownian Motion. (arXiv:2001.09327v2 [cs.LG] UPDATED)
    (0 min) We consider the problem of Bayesian optimization of a one-dimensional Brownian motion in which the $T$ adaptively chosen observations are corrupted by Gaussian noise. We show that as the smallest possible expected cumulative regret and the smallest possible expected simple regret scale as $\Omega(\sigma\sqrt{T / \log (T)}) \cap \mathcal{O}(\sigma\sqrt{T} \cdot \log T)$ and $\Omega(\sigma / \sqrt{T \log (T)}) \cap \mathcal{O}(\sigma\log T / \sqrt{T})$ respectively, where $\sigma^2$ is the noise variance. Thus, our upper and lower bounds are tight up to a factor of $\mathcal{O}( (\log T)^{1.5} )$. The upper bound uses an algorithm based on confidence bounds and the Markov property of Brownian motion (among other useful properties), and the lower bound is based on a reduction to binary hypothesis testing.
    Distribution-Free Federated Learning with Conformal Predictions. (arXiv:2110.07661v2 [cs.LG] UPDATED)
    (0 min) Federated learning has attracted considerable interest for collaborative machine learning in healthcare to leverage separate institutional datasets while maintaining patient privacy. However, additional challenges such as poor calibration and lack of interpretability may also hamper widespread deployment of federated models into clinical practice, leading to user distrust or misuse of ML tools in high-stakes clinical decision-making. In this paper, we propose to address these challenges by incorporating an adaptive conformal framework into federated learning to ensure distribution-free prediction sets that provide coverage guarantees. Importantly, these uncertainty estimates can be obtained without requiring any additional modifications to the model. Empirical results on the MedMNIST medical imaging benchmark demonstrate our federated method provides tighter coverage over local conformal predictions on 6 different medical imaging datasets for 2D and 3D multi-class classification tasks. Furthermore, we correlate class entropy with prediction set size to assess task uncertainty.
    GradTail: Learning Long-Tailed Data Using Gradient-based Sample Weighting. (arXiv:2201.05938v1 [cs.LG])
    (0 min) We propose GradTail, an algorithm that uses gradients to improve model performance on the fly in the face of long-tailed training data distributions. Unlike conventional long-tail classifiers which operate on converged - and possibly overfit - models, we demonstrate that an approach based on gradient dot product agreement can isolate long-tailed data early on during model training and improve performance by dynamically picking higher sample weights for that data. We show that such upweighting leads to model improvements for both classification and regression models, the latter of which are relatively unexplored in the long-tail literature, and that the long-tail examples found by gradient alignment are consistent with our semantic expectations.
    Enhancement of Healthcare Data Performance Metrics using Neural Network Machine Learning Algorithms. (arXiv:2201.05962v1 [cs.LG])
    (0 min) Patients are often encouraged to make use of wearable devices for remote collection and monitoring of health data. This adoption of wearables results in a significant increase in the volume of data collected and transmitted. The battery life of the devices is then quickly diminished due to the high processing requirements of the devices. Given the importance attached to medical data, it is imperative that all transmitted data adhere to strict integrity and availability requirements. Reducing the volume of healthcare data for network transmission may improve sensor battery life without compromising accuracy. There is a trade-off between efficiency and accuracy which can be controlled by adjusting the sampling and transmission rates. This paper demonstrates that machine learning can be used to analyse complex health data metrics such as the accuracy and efficiency of data transmission to overcome the trade-off problem. The study uses time series nonlinear autoregressive neural network algorithms to enhance both data metrics by taking fewer samples to transmit. The algorithms were tested with a standard heart rate dataset to compare their accuracy and efficiency. The result showed that the Levenbery-Marquardt algorithm was the best performer with an efficiency of 3.33 and accuracy of 79.17%, which is similar to other algorithms accuracy but demonstrates improved efficiency. This proves that machine learning can improve without sacrificing a metric over the other compared to the existing methods with high efficiency.
    Big Data Application for Network Level Travel Time Prediction. (arXiv:2201.05760v1 [cs.LG])
    (0 min) Travel time is essential in advanced traveler information systems (ATIS). This paper used the big data analytics engines Apache Spark and Apache MXNet for data processing and modeling. The efficiency gain was evaluated by comparing it with popular data science and deep learning frameworks. The hierarchical feature pooling is explored for both between layer and the output layer LSTM (Long-Short-Term-Memory). The designed hierarchical LSTM (hiLSTM) model can consider the dependencies at a different time scale to capture the spatial-temporal correlations from network-level corridor travel time. A self-attention module is then used to connect temporal and spatial features to the fully connected layers, predicting travel time for all corridors instead of a single link/route. Seasonality and autocorrelation were performed to explore the trend of time-varying data. The case study shows that the Hierarchical LSTM with Attention (hiLSTMat) model gives the best result and outperforms baseline models. The California Bay Area corridor travel time dataset covering four-year periods was published from Caltrans Performance Measurement System (PeMS) system.
    Toward Among-Device AI from On-Device AI with Stream Pipelines. (arXiv:2201.06026v1 [cs.LG])
    (0 min) Modern consumer electronic devices often provide intelligence services with deep neural networks. We have started migrating the computing locations of intelligence services from cloud servers (traditional AI systems) to the corresponding devices (on-device AI systems). On-device AI systems generally have the advantages of preserving privacy, removing network latency, and saving cloud costs. With the emergent of on-device AI systems having relatively low computing power, the inconsistent and varying hardware resources and capabilities pose difficulties. Authors' affiliation has started applying a stream pipeline framework, NNStreamer, for on-device AI systems, saving developmental costs and hardware resources and improving performance. We want to expand the types of devices and applications with on-device AI services products of both the affiliation and second/third parties. We also want to make each AI service atomic, re-deployable, and shared among connected devices of arbitrary vendors; we now have yet another requirement introduced as it always has been. The new requirement of "among-device AI" includes connectivity between AI pipelines so that they may share computing resources and hardware capabilities across a wide range of devices regardless of vendors and manufacturers. We propose extensions of the stream pipeline framework, NNStreamer, for on-device AI so that NNStreamer may provide among-device AI capability. This work is a Linux Foundation (LF AI and Data) open source project accepting contributions from the general public.
    IBAC: An Intelligent Dynamic Bandwidth Channel Access Avoiding Outside Warning Range Problem. (arXiv:2201.05727v1 [cs.NI])
    (0 min) IEEE 802.11ax uses the concept of primary and secondary channels, leading to the Dynamic Bandwidth Channel Access (DBCA) mechanism. By applying DBCA, a wireless station can select a wider channel bandwidth, such as 40/80/160 MHz, by applying the channel bonding feature. However, during channel bonding, inappropriate bandwidth selection can cause collisions. Therefore, to avoid collisions, a well-developed media access control (MAC) protocol is crucial to effectively utilize the channel bonding mechanism. In this paper, we address a collision scenario, called Outside Warning Range Problem (OWRP), that may occur during DBCA when a wireless station interferes with another wireless station after channel bonding is performed. Therefore, we propose a MAC layer mechanism, Intelligent Bonding Avoiding Collision (IBAC), that adapts the channel bonding level in DBCA in order to avoid the OWRP. We first design a theoretical model based on Markov chains for DBCA while avoiding the OWRP. Based on this model, we design a Thompson sampling based Bayesian approach to select the best possible channel bonding level intelligently. We analyze the performance of the IBAC through simulations where it is observed that, comparing to other competing mechanisms, the proposed approach can enhance the network performance significantly while avoiding the OWRP.
    Unbiased deep solvers for linear parametric PDEs. (arXiv:1810.05094v4 [q-fin.CP] UPDATED)
    (0 min) We develop several deep learning algorithms for approximating families of parametric PDE solutions. The proposed algorithms approximate solutions together with their gradients, which in the context of mathematical finance means that the derivative prices and hedging strategies are computed simulatenously. Having approximated the gradient of the solution one can combine it with a Monte-Carlo simulation to remove the bias in the deep network approximation of the PDE solution (derivative price). This is achieved by leveraging the Martingale Representation Theorem and combining the Monte Carlo simulation with the neural network. The resulting algorithm is robust with respect to quality of the neural network approximation and consequently can be used as a black-box in case only limited a priori information about the underlying problem is available. We believe this is important as neural network based algorithms often require fair amount of tuning to produce satisfactory results. The methods are empirically shown to work for high-dimensional problems (e.g. 100 dimensions). We provide diagnostics that shed light on appropriate network architectures.
    Neural-Symbolic Integration for Interactive Learning and Conceptual Grounding. (arXiv:2112.11805v2 [cs.AI] UPDATED)
    (0 min) We propose neural-symbolic integration for abstract concept explanation and interactive learning. Neural-symbolic integration and explanation allow users and domain-experts to learn about the data-driven decision making process of large neural models. The models are queried using a symbolic logic language. Interaction with the user then confirms or rejects a revision of the neural model using logic-based constraints that can be distilled into the model architecture. The approach is illustrated using the Logic Tensor Network framework alongside Concept Activation Vectors and applied to a Convolutional Neural Network.
    Theoretical analysis and computation of the sample Frechet mean for sets of large graphs based on spectral information. (arXiv:2201.05923v1 [stat.ML])
    (0 min) To characterize the location (mean, median) of a set of graphs, one needs a notion of centrality that is adapted to metric spaces, since graph sets are not Euclidean spaces. A standard approach is to consider the Frechet mean. In this work, we equip a set of graphs with the pseudometric defined by the norm between the eigenvalues of their respective adjacency matrix. Unlike the edit distance, this pseudometric reveals structural changes at multiple scales, and is well adapted to studying various statistical problems for graph-valued data. We describe an algorithm to compute an approximation to the sample Frechet mean of a set of undirected unweighted graphs with a fixed size using this pseudometric.
    Wrapped Classifier with Dummy Teacher for training physics-based classifier at unlabeled radar data. (arXiv:2201.05735v1 [physics.geo-ph])
    (0 min) In the paper a method for automatic classification of signals received by EKB and MAGW ISTP SB RAS coherent scatter radars (8-20MHz operating frequency) during 2021 is described. The method is suitable for automatic physical interpretation of the resulting classification of the experimental data in realtime. We called this algorithm Wrapped Classifier with Dummy Teacher. The method is trained on unlabeled dataset and is based on training optimal physics-based classification using clusterization results. The approach is close to optimal embedding search, where the embedding is interpreted as a vector of probabilities for soft classification. The approach allows to find optimal classification algorithm, based on physically interpretable parameters of the received data, both obtained during physics-based numerical simulation and measured experimentally. Dummy Teacher clusterer used for labeling unlabeled dataset is gaussian mixture clustering algorithm. For algorithm functioning we extended the parameters obtained by the radar with additional parameters, calculated during simulation of radiowave propagation using ray-tracing and IRI-2012 and IGRF models for ionosphere and Earth's magnetic field correspondingly. For clustering by Dummy Teacher we use the whole dataset of available parameters (measured and simulated ones). For classification by Wrapped Classifier we use only well physically interpreted parameters. As a result we trained the classification network and found 11 well-interpretable classes from physical point of view in the available data. Five other found classes are not interpretable from physical point of view, demonstrating the importance of taking into account radiowave propagation for correct classification.
    The Dark Side of the Language: Pre-trained Transformers in the DarkNet. (arXiv:2201.05613v1 [cs.CL])
    (0 min) Pre-trained Transformers are challenging human performances in many natural language processing tasks. The gigantic datasets used for pre-training seem to be the key for their success on existing tasks. In this paper, we explore how a range of pre-trained natural language understanding models perform on truly novel and unexplored data, provided by classification tasks over a DarkNet corpus. Surprisingly, results show that syntactic and lexical neural networks largely outperform pre-trained Transformers. This seems to suggest that pre-trained Transformers have serious difficulties in adapting to radically novel texts.
    Concise Logarithmic Loss Function for Robust Training of Anomaly Detection Model. (arXiv:2201.05748v1 [cs.LG])
    (0 min) Recently, deep learning-based algorithms are widely adopted due to the advantage of being able to establish anomaly detection models without or with minimal domain knowledge of the task. Instead, to train the artificial neural network more stable, it should be better to define the appropriate neural network structure or the loss function. For the training anomaly detection model, the mean squared error (MSE) function is adopted widely. On the other hand, the novel loss function, logarithmic mean squared error (LMSE), is proposed in this paper to train the neural network more stable. This study covers a variety of comparisons from mathematical comparisons, visualization in the differential domain for backpropagation, loss convergence in the training process, and anomaly detection performance. In an overall view, LMSE is superior to the existing MSE function in terms of strongness of loss convergence, anomaly detection performance. The LMSE function is expected to be applicable for training not only the anomaly detection model but also the general generative neural network.
    Domain Adaptation via Bidirectional Cross-Attention Transformer. (arXiv:2201.05887v1 [cs.CV])
    (0 min) Domain Adaptation (DA) aims to leverage the knowledge learned from a source domain with ample labeled data to a target domain with unlabeled data only. Most existing studies on DA contribute to learning domain-invariant feature representations for both domains by minimizing the domain gap based on convolution-based neural networks. Recently, vision transformers significantly improved performance in multiple vision tasks. Built on vision transformers, in this paper we propose a Bidirectional Cross-Attention Transformer (BCAT) for DA with the aim to improve the performance. In the proposed BCAT, the attention mechanism can extract implicit source and target mix-up feature representations to narrow the domain discrepancy. Specifically, in BCAT, we design a weight-sharing quadruple-branch transformer with a bidirectional cross-attention mechanism to learn domain-invariant feature representations. Extensive experiments demonstrate that the proposed BCAT model achieves superior performance on four benchmark datasets over existing state-of-the-art DA methods that are based on convolutions or transformers.
    Tractable structured natural gradient descent using local parameterizations. (arXiv:2102.07405v10 [stat.ML] UPDATED)
    (0 min) Natural-gradient descent (NGD) on structured parameter spaces (e.g., low-rank covariances) is computationally challenging due to difficult Fisher-matrix computations. We address this issue by using \emph{local-parameter coordinates} to obtain a flexible and efficient NGD method that works well for a wide-variety of structured parameterizations. We show four applications where our method (1) generalizes the exponential natural evolutionary strategy, (2) recovers existing Newton-like algorithms, (3) yields new structured second-order algorithms via matrix groups, and (4) gives new algorithms to learn covariances of Gaussian and Wishart-based distributions. We show results on a range of problems from deep learning, variational inference, and evolution strategies. Our work opens a new direction for scalable structured geometric methods.
    Interpretable and Effective Reinforcement Learning for Attacking against Graph-based Rumor Detection. (arXiv:2201.05819v1 [cs.LG])
    (0 min) Social networks are polluted by rumors, which can be detected by machine learning models. However, the models are fragile and understanding the vulnerabilities is critical to rumor detection. Certain vulnerabilities are due to dependencies on the graphs and suspiciousness ranking and are difficult for end-to-end methods to learn from limited noisy data. With a black-box detector, we design features capturing the dependencies to allow a reinforcement learning to learn an effective and interpretable attack policy based on the detector output. To speed up learning, we devise: (i) a credit assignment method that decomposes delayed rewards to individual attacking steps proportional to their effects; (ii) a time-dependent control variate to reduce variance due to large graphs and many attacking steps. On two social rumor datasets, we demonstrate: (i) the effectiveness of the attacks compared to rule-based attacks and end-to-end approaches; (ii) the usefulness of the proposed credit assignment strategy and control variate; (iii) interpretability of the policy when generating strong attacks.
    Logarithmic Regret in Multisecretary and Online Linear Programming Problems with Continuous Valuations. (arXiv:1912.08917v3 [cs.LG] UPDATED)
    (0 min) I study a general revenue management problem in which $ n $ customers arrive sequentially over $ n $ periods, and you must dynamically decide which to satisfy. Satisfying the period-$ t $ customer yields utility $ u_{t} \in \mathbb{R}_{+} $ and decreases your inventory holdings by $ A_{t} \in \mathbb{R}_{+}^{M} $. The customer vectors, $ (u_{t}, A_{t}')' $, are i.i.d., with $ u_{t} $ drawn from a finite-mean continuous distribution and $ A_{t} $ drawn from a bounded discrete or continuous distribution. I study this system's regret, which is the additional utility you could get if you didn't have to make decisions on the fly. I show that if your initial inventory endowment scales linearly with $ n $ then your expected regret is $ \Theta(\log(n)) $ as $ n \rightarrow \infty $. I provide a simple policy that achieves this $ \Theta(\log(n)) $ regret rate. Finally, I extend this result to Arlotto and Gurich's (2019) multisecretary problem with uniformly distributed secretary valuations.
  • stat.ML updates on arXiv.org

    Graph Transfer Learning via Adversarial Domain Adaptation with Graph Convolution. (arXiv:1909.01541v2 [cs.LG] UPDATED)
    (2 min) This paper studies the problem of cross-network node classification to overcome the insufficiency of labeled data in a single network. It aims to leverage the label information in a partially labeled source network to assist node classification in a completely unlabeled or partially labeled target network. Existing methods for single network learning cannot solve this problem due to the domain shift across networks. Some multi-network learning methods heavily rely on the existence of cross-network connections, thus are inapplicable for this problem. To tackle this problem, we propose a novel graph transfer learning framework AdaGCN by leveraging the techniques of adversarial domain adaptation and graph convolution. It consists of two components: a semi-supervised learning component and an adversarial domain adaptation component. The former aims to learn class discriminative node representations with given label information of the source and target networks, while the latter contributes to mitigating the distribution divergence between the source and target domains to facilitate knowledge transfer. Extensive empirical evaluations on real-world datasets show that AdaGCN can successfully transfer class information with a low label rate on the source network and a substantial divergence between the source and target domains.
    Differentially Private (Gradient) Expectation Maximization Algorithm with Statistical Guarantees. (arXiv:2010.13520v3 [cs.LG] UPDATED)
    (2 min) (Gradient) Expectation Maximization (EM) is a widely used algorithm for estimating the maximum likelihood of mixture models or incomplete data problems. A major challenge facing this popular technique is how to effectively preserve the privacy of sensitive data. Previous research on this problem has already lead to the discovery of some Differentially Private (DP) algorithms for (Gradient) EM. However, unlike in the non-private case, existing techniques are not yet able to provide finite sample statistical guarantees. To address this issue, we propose in this paper the first DP version of (Gradient) EM algorithm with statistical guarantees. Moreover, we apply our general framework to three canonical models: Gaussian Mixture Model (GMM), Mixture of Regressions Model (MRM) and Linear Regression with Missing Covariates (RMC). Specifically, for GMM in the DP model, our estimation error is near optimal in some cases. For the other two models, we provide the first finite sample statistical guarantees. Our theory is supported by thorough numerical experiments.
    Neighborhood Region Smoothing Regularization for Finding Flat Minima In Deep Neural Networks. (arXiv:2201.06064v1 [cs.LG])
    (2 min) Due to diverse architectures in deep neural networks (DNNs) with severe overparameterization, regularization techniques are critical for finding optimal solutions in the huge hypothesis space. In this paper, we propose an effective regularization technique, called Neighborhood Region Smoothing (NRS). NRS leverages the finding that models would benefit from converging to flat minima, and tries to regularize the neighborhood region in weight space to yield approximate outputs. Specifically, gap between outputs of models in the neighborhood region is gauged by a defined metric based on Kullback-Leibler divergence. This metric provides similar insights with the minimum description length principle on interpreting flat minima. By minimizing both this divergence and empirical loss, NRS could explicitly drive the optimizer towards converging to flat minima. We confirm the effectiveness of NRS by performing image classification tasks across a wide range of model architectures on commonly-used datasets such as CIFAR and ImageNet, where generalization ability could be universally improved. Also, we empirically show that the minima found by NRS would have relatively smaller Hessian eigenvalues compared to the conventional method, which is considered as the evidence of flat minima.
    Extended Stochastic Block Models with Application to Criminal Networks. (arXiv:2007.08569v3 [stat.ME] UPDATED)
    (0 min) Reliably learning group structures among nodes in network data is challenging in several applications. We are particularly motivated by studying covert networks that encode relationships among criminals. These data are subject to measurement errors, and exhibit a complex combination of an unknown number of core-periphery, assortative and disassortative structures that may unveil key architectures of the criminal organization. The coexistence of these noisy block patterns limits the reliability of routinely-used community detection algorithms, and requires extensions of model-based solutions to realistically characterize the node partition process, incorporate information from node attributes, and provide improved strategies for estimation and uncertainty quantification. To cover these gaps, we develop a new class of extended stochastic block models (ESBM) that infer groups of nodes having common connectivity patterns via Gibbs-type priors on the partition process. This choice encompasses many realistic priors for criminal networks, covering solutions with fixed, random and infinite number of possible groups, and facilitates the inclusion of node attributes in a principled manner. Among the new alternatives in our class, we focus on the Gnedin process as a realistic prior that allows the number of groups to be finite, random and subject to a reinforcement process coherent with criminal networks. A collapsed Gibbs sampler is proposed for the whole ESBM class, and refined strategies for estimation, prediction, uncertainty quantification and model selection are outlined. The ESBM performance is illustrated in realistic simulations and in an application to an Italian mafia network, where we unveil key complex block structures, mostly hidden from state-of-the-art alternatives.
    Deep learning via LSTM models for COVID-19 infection forecasting in India. (arXiv:2101.11881v2 [cs.LG] UPDATED)
    (0 min) The COVID-19 pandemic continues to have major impact to health and medical infrastructure, economy, and agriculture. Prominent computational and mathematical models have been unreliable due to the complexity of the spread of infections. Moreover, lack of data collection and reporting makes modelling attempts difficult and unreliable. Hence, we need to re-look at the situation with reliable data sources and innovative forecasting models. Deep learning models such as recurrent neural networks are well suited for modelling spatiotemporal sequences. In this paper, we apply recurrent neural networks such as long short term memory (LSTM), bidirectional LSTM, and encoder-decoder LSTM models for multi-step (short-term) COVID-19 infection forecasting. We select Indian states with COVID-19 hotpots and capture the first (2020) and second (2021) wave of infections and provide two months ahead forecast. Our model predicts that the likelihood of another wave of infections in October and November 2021 is low; however, the authorities need to be vigilant given emerging variants of the virus. The accuracy of the predictions motivate the application of the method in other countries and regions. Nevertheless, the challenges in modelling remain due to the reliability of data and difficulties in capturing factors such as population density, logistics, and social aspects such as culture and lifestyle.
    Enhanced Membership Inference Attacks against Machine Learning Models. (arXiv:2111.09679v2 [cs.LG] UPDATED)
    (0 min) How much does a given trained model leak about each individual data record in its training set? Membership inference attacks are used as an auditing tool to quantify the private information that a model leaks about the individual data points in its training set. Membership inference attacks are influenced by different uncertainties that an attacker has to resolve about training data, the training algorithm, and the underlying data distribution. Thus attack success rates, of many attacks in the literature, do not precisely capture the information leakage of models about their data, as they also reflect other uncertainties that the attack algorithm has. In this paper, we explain the implicit assumptions and also the simplifications made in prior work using the framework of hypothesis testing. We also derive new attack algorithms from the framework that can achieve a high AUC score while also highlighting the different factors that affect their performance. Our algorithms capture a very precise approximation of privacy loss in models, and can be used as a tool to perform an accurate and informed estimation of privacy risk in machine learning models. We provide a thorough empirical evaluation of our attack strategies on various machine learning tasks and benchmark datasets.
    Fair Interpretable Learning via Correction Vectors. (arXiv:2201.06343v1 [cs.LG])
    (0 min) Neural network architectures have been extensively employed in the fair representation learning setting, where the objective is to learn a new representation for a given vector which is independent of sensitive information. Various "representation debiasing" techniques have been proposed in the literature. However, as neural networks are inherently opaque, these methods are hard to comprehend, which limits their usefulness. We propose a new framework for fair representation learning which is centered around the learning of "correction vectors", which have the same dimensionality as the given data vectors. The corrections are then simply summed up to the original features, and can therefore be analyzed as an explicit penalty or bonus to each feature. We show experimentally that a fair representation learning problem constrained in such a way does not impact performance.
    Physical Derivatives: Computing policy gradients by physical forward-propagation. (arXiv:2201.05830v1 [cs.RO])
    (0 min) Model-free and model-based reinforcement learning are two ends of a spectrum. Learning a good policy without a dynamic model can be prohibitively expensive. Learning the dynamic model of a system can reduce the cost of learning the policy, but it can also introduce bias if it is not accurate. We propose a middle ground where instead of the transition model, the sensitivity of the trajectories with respect to the perturbation of the parameters is learned. This allows us to predict the local behavior of the physical system around a set of nominal policies without knowing the actual model. We assay our method on a custom-built physical robot in extensive experiments and show the feasibility of the approach in practice. We investigate potential challenges when applying our method to physical systems and propose solutions to each of them.
    Socioeconomic disparities and COVID-19: the causal connections. (arXiv:2201.07026v1 [cs.LG])
    (0 min) The analysis of causation is a challenging task that can be approached in various ways. With the increasing use of machine learning based models in computational socioeconomics, explaining these models while taking causal connections into account is a necessity. In this work, we advocate the use of an explanatory framework from cooperative game theory augmented with $do$ calculus, namely causal Shapley values. Using causal Shapley values, we analyze socioeconomic disparities that have a causal link to the spread of COVID-19 in the USA. We study several phases of the disease spread to show how the causal connections change over time. We perform a causal analysis using random effects models and discuss the correspondence between the two methods to verify our results. We show the distinct advantages a non-linear machine learning models have over linear models when performing a multivariate analysis, especially since the machine learning models can map out non-linear correlations in the data. In addition, the causal Shapley values allow for including the causal structure in the variable importance computed for the machine learning model.
    On the Number of Edges of the Frechet Mean and Median Graphs. (arXiv:2105.14397v4 [math.CO] UPDATED)
    (0 min) The availability of large datasets composed of graphs creates an unprecedented need to invent novel tools in statistical learning for graph-valued random variables. To characterize the average of a sample of graphs, one can compute the sample Frechet mean and median graphs. In this paper, we address the following foundational question: does a mean or median graph inherit the structural properties of the graphs in the sample? An important graph property is the edge density; we establish that edge density is an hereditary property, which can be transmitted from a graph sample to its sample Frechet mean or median graphs, irrespective of the method used to estimate the mean or the median. Because of the prominence of the Frechet mean in graph-valued machine learning, this novel theoretical result has some significant practical consequences.
    A Short Tutorial on The Weisfeiler-Lehman Test And Its Variants. (arXiv:2201.07083v1 [stat.ML])
    (0 min) Graph neural networks are designed to learn functions on graphs. Typically, the relevant target functions are invariant with respect to actions by permutations. Therefore the design of some graph neural network architectures has been inspired by graph-isomorphism algorithms. The classical Weisfeiler-Lehman algorithm (WL) -- a graph-isomorphism test based on color refinement -- became relevant to the study of graph neural networks. The WL test can be generalized to a hierarchy of higher-order tests, known as $k$-WL. This hierarchy has been used to characterize the expressive power of graph neural networks, and to inspire the design of graph neural network architectures. A few variants of the WL hierarchy appear in the literature. The goal of this short note is pedagogical and practical: We explain the differences between the WL and folklore-WL formulations, with pointers to existing discussions in the literature. We illuminate the differences between the formulations by visualizing an example.
    Treatment Effect Risk: Bounds and Inference. (arXiv:2201.05893v1 [stat.ME])
    (0 min) Since the average treatment effect (ATE) measures the change in social welfare, even if positive, there is a risk of negative effect on, say, some 10% of the population. Assessing such risk is difficult, however, because any one individual treatment effect (ITE) is never observed so the 10% worst-affected cannot be identified, while distributional treatment effects only compare the first deciles within each treatment group, which does not correspond to any 10%-subpopulation. In this paper we consider how to nonetheless assess this important risk measure, formalized as the conditional value at risk (CVaR) of the ITE distribution. We leverage the availability of pre-treatment covariates and characterize the tightest-possible upper and lower bounds on ITE-CVaR given by the covariate-conditional average treatment effect (CATE) function. Some bounds can also be interpreted as summarizing a complex CATE function into a single metric and are of interest independently of being a bound. We then proceed to study how to estimate these bounds efficiently from data and construct confidence intervals. This is challenging even in randomized experiments as it requires understanding the distribution of the unknown CATE function, which can be very complex if we use rich covariates so as to best control for heterogeneity. We develop a debiasing method that overcomes this and prove it enjoys favorable statistical properties even when CATE and other nuisances are estimated by black-box machine learning or even inconsistently. Studying a hypothetical change to French job-search counseling services, our bounds and inference demonstrate a small social benefit entails a negative impact on a substantial subpopulation.
    Robust Learning-based Predictive Control for Discrete-time Nonlinear Systems with Unknown Dynamics and State Constraints. (arXiv:1911.09827v4 [eess.SY] UPDATED)
    (0 min) Robust model predictive control (MPC) is a well-known control technique for model-based control with constraints and uncertainties. In classic robust tube-based MPC approaches, an open-loop control sequence is computed via periodically solving an online nominal MPC problem, which requires prior model information and frequent access to onboard computational resources. In this paper, we propose an efficient robust MPC solution based on receding horizon reinforcement learning, called r-LPC, for unknown nonlinear systems with state constraints and disturbances. The proposed r-LPC utilizes a Koopman operator-based prediction model obtained off-line from pre-collected input-output datasets. Unlike classic tube-based MPC, in each prediction time interval of r-LPC, we use an actor-critic structure to learn a near-optimal feedback control policy rather than a control sequence. The resulting closed-loop control policy can be learned off-line and deployed online or learned online in an asynchronous way. In the latter case, online learning can be activated whenever necessary; for instance, the safety constraint is violated with the deployed policy. The closed-loop recursive feasibility, robustness, and asymptotic stability are proven under function approximation errors of the actor-critic networks. Simulation and experimental results on two nonlinear systems with unknown dynamics and disturbances have demonstrated that our approach has better or comparable performance when compared with tube-based MPC and LQR, and outperforms a recently developed actor-critic learning approach.
    Convergence of a robust deep FBSDE method for stochastic control. (arXiv:2201.06854v1 [math.OC])
    (0 min) In this paper we propose a deep learning based numerical scheme for strongly coupled FBSDE, stemming from stochastic control. It is a modification of the deep BSDE method in which the initial value to the backward equation is not a free parameter, and with a new loss function being the weighted sum of the cost of the control problem, and a variance term which coincides with the means square error in the terminal condition. We show by a numerical example that a direct extension of the classical deep BSDE method to FBSDE, fails for a simple linear-quadratic control problem, and motivate why the new method works. Under regularity and boundedness assumptions on the exact controls of time continuous and time discrete control problems we provide an error analysis for our method. We show empirically that the method converges for three different problems, one being the one that failed for a direct extension of the deep BSDE method.
    Targeted Optimal Treatment Regime Learning Using Summary Statistics. (arXiv:2201.06229v1 [stat.ME])
    (0 min) Personalized decision-making, aiming to derive optimal individualized treatment rules (ITRs) based on individual characteristics, has recently attracted increasing attention in many fields, such as medicine, social services, and economics. Current literature mainly focuses on estimating ITRs from a single source population. In real-world applications, the distribution of a target population can be different from that of the source population. Therefore, ITRs learned by existing methods may not generalize well to the target population. Due to privacy concerns and other practical issues, individual-level data from the target population is often not available, which makes ITR learning more challenging. We consider an ITR estimation problem where the source and target populations may be heterogeneous, individual data is available from the source population, and only the summary information of covariates, such as moments, is accessible from the target population. We develop a weighting framework that tailors an ITR for a given target population by leveraging the available summary statistics. Specifically, we propose a calibrated augmented inverse probability weighted estimator of the value function for the target population and estimate an optimal ITR by maximizing this estimator within a class of pre-specified ITRs. We show that the proposed calibrated estimator is consistent and asymptotically normal even with flexible semi/nonparametric models for nuisance function approximation, and the variance of the value estimator can be consistently estimated. We demonstrate the empirical performance of the proposed method using simulation studies and a real application to an eICU dataset as the source sample and a MIMIC-III dataset as the target sample.
    Online, Informative MCMC Thinning with Kernelized Stein Discrepancy. (arXiv:2201.07130v1 [cs.LG])
    (0 min) A fundamental challenge in Bayesian inference is efficient representation of a target distribution. Many non-parametric approaches do so by sampling a large number of points using variants of Markov Chain Monte Carlo (MCMC). We propose an MCMC variant that retains only those posterior samples which exceed a KSD threshold, which we call KSD Thinning. We establish the convergence and complexity tradeoffs for several settings of KSD Thinning as a function of the KSD threshold parameter, sample size, and other problem parameters. Finally, we provide experimental comparisons against other online nonparametric Bayesian methods that generate low-complexity posterior representations, and observe superior consistency/complexity tradeoffs. Code is available at github.com/colehawkins/KSD-Thinning.
    Particle Dual Averaging: Optimization of Mean Field Neural Networks with Global Convergence Rate Analysis. (arXiv:2012.15477v3 [stat.ML] UPDATED)
    (0 min) We propose the particle dual averaging (PDA) method, which generalizes the dual averaging method in convex optimization to the optimization over probability distributions with quantitative runtime guarantee. The algorithm consists of an inner loop and outer loop: the inner loop utilizes the Langevin algorithm to approximately solve for a stationary distribution, which is then optimized in the outer loop. The method can thus be interpreted as an extension of the Langevin algorithm to naturally handle nonlinear functional on the probability space. An important application of the proposed method is the optimization of neural network in the mean field regime, which is theoretically attractive due to the presence of nonlinear feature learning, but quantitative convergence rate can be challenging to obtain. By adapting finite-dimensional convex optimization theory into the space of measures, we analyze PDA in regularized empirical / expected risk minimization, and establish quantitative global convergence in learning two-layer mean field neural networks under more general settings. Our theoretical results are supported by numerical simulations on neural networks with reasonable size.
    Self-supervised Representation Learning of Neuronal Morphologies. (arXiv:2112.12482v2 [stat.ML] UPDATED)
    (0 min) Understanding the diversity of cell types and their function in the brain is one of the key challenges in neuroscience. The advent of large-scale datasets has given rise to the need of unbiased and quantitative approaches to cell type classification. We present GraphDINO, a purely data-driven approach to learning a low dimensional representation of the 3D morphology of neurons. GraphDINO is a novel graph representation learning method for spatial graphs utilizing self-supervised learning on transformer models. It combines attention-based global interaction between nodes and classic graph convolutional processing. We show, in two different species and cortical areas, that this method is able to yield morphological cell type clustering that is comparable to manual feature-based classification and shows a good correspondence to expert-labeled cell types. Our method is applicable beyond neuroscience in settings where samples in a dataset are graphs and graph-level embeddings are desired.
    Neuron-Specific Dropout: A Deterministic Regularization Technique to Prevent Neural Networks from Overfitting & Reduce Dependence on Large Training Samples. (arXiv:2201.06938v1 [cs.LG])
    (0 min) In order to develop complex relationships between their inputs and outputs, deep neural networks train and adjust large number of parameters. To make these networks work at high accuracy, vast amounts of data are needed. Sometimes, however, the quantity of data needed is not present or obtainable for training. Neuron-specific dropout (NSDropout) is a tool to address this problem. NSDropout looks at both the training pass, and validation pass, of a layer in a model. By comparing the average values produced by each neuron for each class in a data set, the network is able to drop targeted units. The layer is able to predict what features, or noise, the model is looking at during testing that isn't present when looking at samples from validation. Unlike dropout, the "thinned" networks cannot be "unthinned" for testing. Neuron-specific dropout has proved to achieve similar, if not better, testing accuracy with far less data than traditional methods including dropout and other regularization methods. Experimentation has shown that neuron-specific dropout reduces the chance of a network overfitting and reduces the need for large training samples on supervised learning tasks in image recognition, all while producing best-in-class results.
    Tractable structured natural gradient descent using local parameterizations. (arXiv:2102.07405v10 [stat.ML] UPDATED)
    (0 min) Natural-gradient descent (NGD) on structured parameter spaces (e.g., low-rank covariances) is computationally challenging due to difficult Fisher-matrix computations. We address this issue by using \emph{local-parameter coordinates} to obtain a flexible and efficient NGD method that works well for a wide-variety of structured parameterizations. We show four applications where our method (1) generalizes the exponential natural evolutionary strategy, (2) recovers existing Newton-like algorithms, (3) yields new structured second-order algorithms via matrix groups, and (4) gives new algorithms to learn covariances of Gaussian and Wishart-based distributions. We show results on a range of problems from deep learning, variational inference, and evolution strategies. Our work opens a new direction for scalable structured geometric methods.
    Equitable Community Resilience: The Case of Winter Storm Uri in Texas. (arXiv:2201.06652v1 [stat.ML])
    (0 min) Community resilience in the face of natural hazards relies on a community's potential to bounce back. A failure to integrate equity into resilience considerations results in unequal recovery and disproportionate impacts on vulnerable populations, which has long been a concern in the United States. This research investigated aspects of equity related to community resilience in the aftermath of Winter Storm Uri in Texas which led to extended power outages for more than 4 million households. County level outage and recovery data was analyzed to explore potential significant links between various county attributes and their share of the outages during the recovery and restoration phases. Next, satellite imagery was used to examine data at a much higher geographical resolution focusing on census tracts in the city of Houston. The goal was to use computer vision to extract the extent of outages within census tracts and investigate their linkages to census tracts attributes. Results from various statistical procedures revealed statistically significant negative associations between counties' percentage of non-Hispanic whites and median household income with the ratio of outages. Additionally, at census tract level, variables including percentages of linguistically isolated population and public transport users exhibited positive associations with the group of census tracts that were affected by the outage as detected by computer vision analysis. Informed by these results, engineering solutions such as the applicability of grid modernization technologies, together with distributed and renewable energy resources, when controlled for the region's topographical characteristics, are proposed to enhance equitable power grid resiliency in the face of natural hazards.
    Incompleteness of graph convolutional neural networks for points clouds in three dimensions. (arXiv:2201.07136v1 [stat.ML])
    (0 min) Graph convolutional neural networks (GCNN) are very popular methods in machine learning and have been applied very successfully to the prediction of the properties of molecules and materials. First-order GCNNs are well known to be incomplete, i.e., there exist graphs that are distinct but appear identical when seen through the lens of the GCNN. More complicated schemes have thus been designed to increase their resolving power. Applications to molecules (and more generally, point clouds), however, add a geometric dimension to the problem. The most straightforward and prevalent approach to construct graph representation for the molecules regards atoms as vertices in a graph and draws a bond between each pair of atoms within a certain preselected cutoff. Bonds can be decorated with the distance between atoms, and the resulting "distance graph convolution NNs" (dGCNN) have empirically demonstrated excellent resolving power and are widely used in chemical ML. Here we show that even for the restricted case of graphs induced by 3D atom clouds dGCNNs are not complete. We construct pairs of distinct point clouds that generate graphs that, for any cutoff radius, are equivalent based on a first-order Weisfeiler-Lehman test. This class of degenerate structures includes chemically-plausible configurations, setting an ultimate limit to the expressive power of some of the well-established GCNN architectures for atomistic machine learning. Models that explicitly use angular information in the description of atomic environments can resolve these degeneracies.
    Universal Online Learning: an Optimistically Universal Learning Rule. (arXiv:2201.05947v1 [cs.LG])
    (0 min) We study the subject of universal online learning with non-i.i.d. processes for bounded losses. The notion of an universally consistent learning was defined by Hanneke in an effort to study learning theory under minimal assumptions, where the objective is to obtain low long-run average loss for any target function. We are interested in characterizing processes for which learning is possible and whether there exist learning rules guaranteed to be universally consistent given the only assumption that such learning is possible. The case of unbounded losses is very restrictive, since the learnable processes almost surely visit a finite number of points and as a result, simple memorization is optimistically universal. We focus on the bounded setting and give a complete characterization of the processes admitting strong and weak universal learning. We further show that k-nearest neighbor algorithm (kNN) is not optimistically universal and present a novel variant of 1NN which is optimistically universal for general input and value spaces in both strong and weak setting. This closes all COLT 2021 open problems posed by Hanneke on universal online learning.
    Tk-merge: Computationally Efficient Robust Clustering Under General Assumptions. (arXiv:2201.06391v1 [stat.ME])
    (0 min) We address general-shaped clustering problems under very weak parametric assumptions with a two-step hybrid robust clustering algorithm based on trimmed k-means and hierarchical agglomeration. The algorithm has low computational complexity and effectively identifies the clusters also in presence of data contamination. We also present natural generalizations of the approach as well as an adaptive procedure to estimate the amount of contamination in a data-driven fashion. Our proposal outperforms state-of-the-art robust, model-based methods in our numerical simulations and real-world applications related to color quantization for image analysis, human mobility patterns based on GPS data, biomedical images of diabetic retinopathy, and functional data across weather stations.
    Bandit Data-Driven Optimization. (arXiv:2008.11707v2 [cs.LG] UPDATED)
    (0 min) Applications of machine learning in the non-profit and public sectors often feature an iterative workflow of data acquisition, prediction, and optimization of interventions. There are four major pain points that a machine learning pipeline must overcome in order to be actually useful in these settings: small data, data collected only under the default intervention, unmodeled objectives due to communication gap, and unforeseen consequences of the intervention. In this paper, we introduce bandit data-driven optimization, the first iterative prediction-prescription framework to address these pain points. Bandit data-driven optimization combines the advantages of online bandit learning and offline predictive analytics in an integrated framework. We propose PROOF, a novel algorithm for this framework and formally prove that it has no-regret. Using numerical simulations, we show that PROOF achieves superior performance than existing baseline. We also apply PROOF in a detailed case study of food rescue volunteer recommendation, and show that PROOF as a framework works well with the intricacies of ML models in real-world AI for non-profit and public sector applications.
    Stochastic Multi-Armed Bandits with Control Variates. (arXiv:2105.03962v3 [cs.LG] UPDATED)
    (0 min) This paper studies a new variant of the stochastic multi-armed bandits problem where auxiliary information about the arm rewards is available in the form of control variates. In many applications like queuing and wireless networks, the arm rewards are functions of some exogenous variables. The mean values of these variables are known a priori from historical data and can be used as control variates. Leveraging the theory of control variates, we obtain mean estimates with smaller variance and tighter confidence bounds. We develop an upper confidence bound based algorithm named UCB-CV and characterize the regret bounds in terms of the correlation between rewards and control variates when they follow a multivariate normal distribution. We also extend UCB-CV to other distributions using resampling methods like Jackknifing and Splitting. Experiments on synthetic problem instances validate performance guarantees of the proposed algorithms.
    Deep Learning strategies for ProtoDUNE raw data denoising. (arXiv:2103.01596v2 [hep-ph] UPDATED)
    (0 min) In this work, we investigate different machine learning-based strategies for denoising raw simulation data from the ProtoDUNE experiment. The ProtoDUNE detector is hosted by CERN and it aims to test and calibrate the technologies for DUNE, a forthcoming experiment in neutrino physics. The reconstruction workchain consists of converting digital detector signals into physical high-level quantities. We address the first step in reconstruction, namely raw data denoising, leveraging deep learning algorithms. We design two architectures based on graph neural networks, aiming to enhance the receptive field of basic convolutional neural networks. We benchmark this approach against traditional algorithms implemented by the DUNE collaboration. We test the capabilities of graph neural network hardware accelerator setups to speed up training and inference processes.
    Imputing Missing Observations with Time Sliced Synthetic Minority Oversampling Technique. (arXiv:2201.05634v1 [cs.LG])
    (0 min) We present a simple yet novel time series imputation technique with the goal of constructing an irregular time series that is uniform across every sample in a data set. Specifically, we fix a grid defined by the midpoints of non-overlapping bins (dubbed "slices") of observation times and ensure that each sample has values for all of the features at that given time. This allows one to both impute fully missing observations to allow uniform time series classification across the entire data and, in special cases, to impute individually missing features. To do so, we slightly generalize the well-known class imbalance algorithm SMOTE \cite{smote} to allow component wise nearest neighbor interpolation that preserves correlations when there are no missing features. We visualize the method in the simplified setting of 2-dimensional uncoupled harmonic oscillators. Next, we use tSMOTE to train an Encoder/Decoder long-short term memory (LSTM) model with Logistic Regression for predicting and classifying distinct trajectories of different 2D oscillators. After illustrating the the utility of tSMOTE in this context, we use the same architecture to train a clinical model for COVID-19 disease severity on an imputed data set. Our experiments show an improvement over standard mean and median imputation techniques by allowing a wider class of patient trajectories to be recognized by the model, as well as improvement over aggregated classification models.
    Perfect density models cannot guarantee anomaly detection. (arXiv:2012.03808v3 [cs.LG] UPDATED)
    (0 min) Thanks to the tractability of their likelihood, several deep generative models show promise for seemingly straightforward but important applications like anomaly detection, uncertainty estimation, and active learning. However, the likelihood values empirically attributed to anomalies conflict with the expectations these proposed applications suggest. In this paper, we take a closer look at the behavior of distribution densities through the lens of reparametrization and show that these quantities carry less meaningful information than previously thought, beyond estimation issues or the curse of dimensionality. We conclude that the use of these likelihoods for anomaly detection relies on strong and implicit hypotheses, and highlight the necessity of explicitly formulating these assumptions for reliable anomaly detection.
    Deep Indexed Active Learning for Matching Heterogeneous Entity Representations. (arXiv:2104.03986v2 [cs.DB] UPDATED)
    (0 min) Given two large lists of records, the task in entity resolution (ER) is to find the pairs from the Cartesian product of the lists that correspond to the same real world entity. Typically, passive learning methods on such tasks require large amounts of labeled data to yield useful models. Active Learning is a promising approach for ER in low resource settings. However, the search space, to find informative samples for the user to label, grows quadratically for instance-pair tasks making active learning hard to scale. Previous works, in this setting, rely on hand-crafted predicates, pre-trained language model embeddings, or rule learning to prune away unlikely pairs from the Cartesian product. This blocking step can miss out on important regions in the product space leading to low recall. We propose DIAL, a scalable active learning approach that jointly learns embeddings to maximize recall for blocking and accuracy for matching blocked pairs. DIAL uses an Index-By-Committee framework, where each committee member learns representations based on powerful pre-trained transformer language models. We highlight surprising differences between the matcher and the blocker in the creation of the training data and the objective used to train their parameters. Experiments on five benchmark datasets and a multilingual record matching dataset show the effectiveness of our approach in terms of precision, recall and running time. Code is available at https://github.com/ArjitJ/DIAL
    Iterative Causal Discovery in the Possible Presence of Latent Confounders and Selection Bias. (arXiv:2111.04095v2 [cs.LG] UPDATED)
    (0 min) We present a sound and complete algorithm, called iterative causal discovery (ICD), for recovering causal graphs in the presence of latent confounders and selection bias. ICD relies on the causal Markov and faithfulness assumptions and recovers the equivalence class of the underlying causal graph. It starts with a complete graph, and consists of a single iterative stage that gradually refines this graph by identifying conditional independence (CI) between connected nodes. Independence and causal relations entailed after any iteration are correct, rendering ICD anytime. Essentially, we tie the size of the CI conditioning set to its distance on the graph from the tested nodes, and increase this value in the successive iteration. Thus, each iteration refines a graph that was recovered by previous iterations having smaller conditioning sets -- a higher statistical power -- which contributes to stability. We demonstrate empirically that ICD requires significantly fewer CI tests and learns more accurate causal graphs compared to FCI, FCI+, and RFCI algorithms (code is available at https://github.com/IntelLabs/causality-lab).
    Embodied Self-supervised Learning by Coordinated Sampling and Training. (arXiv:2006.13350v2 [eess.AS] UPDATED)
    (0 min) Self-supervised learning can significantly improve the performance of downstream tasks, however, the dimensions of learned representations normally lack explicit physical meanings. In this work, we propose a novel self-supervised approach to solve inverse problems by employing the corresponding physical forward process so that the learned representations can have explicit physical meanings. The proposed approach works in an analysis-by-synthesis manner to learn an inference network by iteratively sampling and training. At the sampling step, given observed data, the inference network is used to approximate the intractable posterior, from which we sample input parameters and feed them to a physical process to generate data in the observational space; At the training step, the same network is optimized with the sampled paired data. We prove the feasibility of the proposed method by tackling the acoustic-to-articulatory inversion problem to infer articulatory information from speech. Given an articulatory synthesizer, an inference model can be trained completely from scratch with random initialization. Our experiments demonstrate that the proposed method can converge steadily and the network learns to control the articulatory synthesizer to speak like a human. We also demonstrate that trained models can generalize well to unseen speakers or even new languages, and performance can be further improved through self-adaptation.
    Robust uncertainty estimates with out-of-distribution pseudo-inputs training. (arXiv:2201.05890v1 [cs.LG])
    (0 min) Probabilistic models often use neural networks to control their predictive uncertainty. However, when making out-of-distribution (OOD)} predictions, the often-uncontrollable extrapolation properties of neural networks yield poor uncertainty predictions. Such models then don't know what they don't know, which directly limits their robustness w.r.t unexpected inputs. To counter this, we propose to explicitly train the uncertainty predictor where we are not given data to make it reliable. As one cannot train without data, we provide mechanisms for generating pseudo-inputs in informative low-density regions of the input space, and show how to leverage these in a practical Bayesian framework that casts a prior distribution over the model uncertainty. With a holistic evaluation, we demonstrate that this yields robust and interpretable predictions of uncertainty while retaining state-of-the-art performance on diverse tasks such as regression and generative modelling
    Minimax Optimality (Probably) Doesn't Imply Distribution Learning for GANs. (arXiv:2201.07206v1 [cs.LG])
    (0 min) Arguably the most fundamental question in the theory of generative adversarial networks (GANs) is to understand to what extent GANs can actually learn the underlying distribution. Theoretical and empirical evidence suggests local optimality of the empirical training objective is insufficient. Yet, it does not rule out the possibility that achieving a true population minimax optimal solution might imply distribution learning. In this paper, we show that standard cryptographic assumptions imply that this stronger condition is still insufficient. Namely, we show that if local pseudorandom generators (PRGs) exist, then for a large family of natural continuous target distributions, there are ReLU network generators of constant depth and polynomial size which take Gaussian random seeds so that (i) the output is far in Wasserstein distance from the target distribution, but (ii) no polynomially large Lipschitz discriminator ReLU network can detect this. This implies that even achieving a population minimax optimal solution to the Wasserstein GAN objective is likely insufficient for distribution learning in the usual statistical sense. Our techniques reveal a deep connection between GANs and PRGs, which we believe will lead to further insights into the computational landscape of GANs.
    Fractional SDE-Net: Generation of Time Series Data with Long-term Memory. (arXiv:2201.05974v1 [cs.LG])
    (0 min) In this paper, we focus on generation of time-series data using neural networks. It is often the case that input time-series data, especially taken from real financial markets, is irregularly sampled, and its noise structure is more complicated than i.i.d. type. To generate time series with such a property, we propose fSDE-Net: neural fractional Stochastic Differential Equation Network. It generalizes the neural SDE model by using fractional Brownian motion with Hurst index larger than half, which exhibits long-term memory property. We derive the solver of fSDE-Net and theoretically analyze the existence and uniqueness of the solution to fSDE-Net. Our experiments demonstrate that the fSDE-Net model can replicate distributional properties well.
    On Maximum-a-Posteriori estimation with Plug & Play priors and stochastic gradient descent. (arXiv:2201.06133v1 [stat.ML])
    (0 min) Bayesian methods to solve imaging inverse problems usually combine an explicit data likelihood function with a prior distribution that explicitly models expected properties of the solution. Many kinds of priors have been explored in the literature, from simple ones expressing local properties to more involved ones exploiting image redundancy at a non-local scale. In a departure from explicit modelling, several recent works have proposed and studied the use of implicit priors defined by an image denoising algorithm. This approach, commonly known as Plug & Play (PnP) regularisation, can deliver remarkably accurate results, particularly when combined with state-of-the-art denoisers based on convolutional neural networks. However, the theoretical analysis of PnP Bayesian models and algorithms is difficult and works on the topic often rely on unrealistic assumptions on the properties of the image denoiser. This papers studies maximum-a-posteriori (MAP) estimation for Bayesian models with PnP priors. We first consider questions related to existence, stability and well-posedness, and then present a convergence proof for MAP computation by PnP stochastic gradient descent (PnP-SGD) under realistic assumptions on the denoiser used. We report a range of imaging experiments demonstrating PnP-SGD as well as comparisons with other PnP schemes.
    Reliable Causal Discovery with Improved Exact Search and Weaker Assumptions. (arXiv:2201.05666v1 [cs.LG])
    (0 min) Many of the causal discovery methods rely on the faithfulness assumption to guarantee asymptotic correctness. However, the assumption can be approximately violated in many ways, leading to sub-optimal solutions. Although there is a line of research in Bayesian network structure learning that focuses on weakening the assumption, such as exact search methods with well-defined score functions, they do not scale well to large graphs. In this work, we introduce several strategies to improve the scalability of exact score-based methods in the linear Gaussian setting. In particular, we develop a super-structure estimation method based on the support of inverse covariance matrix which requires assumptions that are strictly weaker than faithfulness, and apply it to restrict the search space of exact search. We also propose a local search strategy that performs exact search on the local clusters formed by each variable and its neighbors within two hops in the super-structure. Numerical experiments validate the efficacy of the proposed procedure, and demonstrate that it scales up to hundreds of nodes with a high accuracy.
    ResNEsts and DenseNEsts: Block-based DNN Models with Improved Representation Guarantees. (arXiv:2111.05496v2 [cs.LG] UPDATED)
    (0 min) Models recently used in the literature proving residual networks (ResNets) are better than linear predictors are actually different from standard ResNets that have been widely used in computer vision. In addition to the assumptions such as scalar-valued output or single residual block, these models have no nonlinearities at the final residual representation that feeds into the final affine layer. To codify such a difference in nonlinearities and reveal a linear estimation property, we define ResNEsts, i.e., Residual Nonlinear Estimators, by simply dropping nonlinearities at the last residual representation from standard ResNets. We show that wide ResNEsts with bottleneck blocks can always guarantee a very desirable training property that standard ResNets aim to achieve, i.e., adding more blocks does not decrease performance given the same set of basis elements. To prove that, we first recognize ResNEsts are basis function models that are limited by a coupling problem in basis learning and linear prediction. Then, to decouple prediction weights from basis learning, we construct a special architecture termed augmented ResNEst (A-ResNEst) that always guarantees no worse performance with the addition of a block. As a result, such an A-ResNEst establishes empirical risk lower bounds for a ResNEst using corresponding bases. Our results demonstrate ResNEsts indeed have a problem of diminishing feature reuse; however, it can be avoided by sufficiently expanding or widening the input space, leading to the above-mentioned desirable property. Inspired by the DenseNets that have been shown to outperform ResNets, we also propose a corresponding new model called Densely connected Nonlinear Estimator (DenseNEst). We show that any DenseNEst can be represented as a wide ResNEst with bottleneck blocks. Unlike ResNEsts, DenseNEsts exhibit the desirable property without any special architectural re-design.
    On the Periodic Behavior of Neural Network Training with Batch Normalization and Weight Decay. (arXiv:2106.15739v3 [cs.LG] UPDATED)
    (0 min) Training neural networks with batch normalization and weight decay has become a common practice in recent years. In this work, we show that their combined use may result in a surprising periodic behavior of optimization dynamics: the training process regularly exhibits destabilizations that, however, do not lead to complete divergence but cause a new period of training. We rigorously investigate the mechanism underlying the discovered periodic behavior from both empirical and theoretical points of view and analyze the conditions in which it occurs in practice. We also demonstrate that periodic behavior can be regarded as a generalization of two previously opposing perspectives on training with batch normalization and weight decay, namely the equilibrium presumption and the instability presumption.
    Inferential Theory for Granular Instrumental Variables in High Dimensions. (arXiv:2201.06605v1 [econ.EM])
    (0 min) The Granular Instrumental Variables (GIV) methodology exploits panels with factor error structures to construct instruments to estimate structural time series models with endogeneity even after controlling for latent factors. We extend the GIV methodology in several dimensions. First, we extend the identification procedure to a large $N$ and large $T$ framework, which depends on the asymptotic Herfindahl index of the size distribution of $N$ cross-sectional units. Second, we treat both the factors and loadings as unknown and show that the sampling error in the estimated instrument and factors is negligible when considering the limiting distribution of the structural parameters. Third, we show that the sampling error in the high-dimensional precision matrix is negligible in our estimation algorithm. Fourth, we overidentify the structural parameters with additional constructed instruments, which leads to efficiency gains. Monte Carlo evidence is presented to support our asymptotic theory and application to the global crude oil market leads to new results.
    SGLB: Stochastic Gradient Langevin Boosting. (arXiv:2001.07248v5 [cs.LG] UPDATED)
    (0 min) This paper introduces Stochastic Gradient Langevin Boosting (SGLB) - a powerful and efficient machine learning framework that may deal with a wide range of loss functions and has provable generalization guarantees. The method is based on a special form of the Langevin diffusion equation specifically designed for gradient boosting. This allows us to theoretically guarantee the global convergence even for multimodal loss functions, while standard gradient boosting algorithms can guarantee only local optimum. We also empirically show that SGLB outperforms classic gradient boosting when applied to classification tasks with 0-1 loss function, which is known to be multimodal.
    Chaining Value Functions for Off-Policy Learning. (arXiv:2201.06468v1 [cs.LG])
    (0 min) To accumulate knowledge and improve its policy of behaviour, a reinforcement learning agent can learn `off-policy' about policies that differ from the policy used to generate its experience. This is important to learn counterfactuals, or because the experience was generated out of its own control. However, off-policy learning is non-trivial, and standard reinforcement-learning algorithms can be unstable and divergent. In this paper we discuss a novel family of off-policy prediction algorithms which are convergent by construction. The idea is to first learn on-policy about the data-generating behaviour, and then bootstrap an off-policy value estimate on this on-policy estimate, thereby constructing a value estimate that is partially off-policy. This process can be repeated to build a chain of value functions, each time bootstrapping a new estimate on the previous estimate in the chain. Each step in the chain is stable and hence the complete algorithm is guaranteed to be stable. Under mild conditions this comes arbitrarily close to the off-policy TD solution when we increase the length of the chain. Hence it can compute the solution even in cases where off-policy TD diverges. We prove that the proposed scheme is convergent and corresponds to an iterative decomposition of the inverse key matrix. Furthermore it can be interpreted as estimating a novel objective -- that we call a `k-step expedition' -- of following the target policy for finitely many steps before continuing indefinitely with the behaviour policy. Empirically we evaluate the idea on challenging MDPs such as Baird's counter example and observe favourable results.
    System-Agnostic Meta-Learning for MDP-based Dynamic Scheduling via Descriptive Policy. (arXiv:2201.07051v1 [cs.LG])
    (0 min) Dynamic scheduling is an important problem in applications from queuing to wireless networks. It addresses how to choose an item among multiple scheduling items in each timestep to achieve a long-term goal. Conventional approaches for dynamic scheduling find the optimal policy for a given specific system so that the policy from these approaches is usable only for the corresponding system characteristics. Hence, it is hard to use such approaches for a practical system in which system characteristics dynamically change. This paper proposes a novel policy structure for MDP-based dynamic scheduling, a descriptive policy, which has a system-agnostic capability to adapt to unseen system characteristics for an identical task (dynamic scheduling). To this end, the descriptive policy learns a system-agnostic scheduling principle--in a nutshell, "which condition of items should have a higher priority in scheduling". The scheduling principle can be applied to any system so that the descriptive policy learned in one system can be used for another system. Experiments with simple explanatory and realistic application scenarios demonstrate that it enables system-agnostic meta-learning with very little performance degradation compared with the system-specific conventional policies.
    Reconstruction of Incomplete Wildfire Data using Deep Generative Models. (arXiv:2201.06153v1 [stat.ML])
    (0 min) We present our submission to the Extreme Value Analysis 2021 Data Challenge in which teams were asked to accurately predict distributions of wildfire frequency and size within spatio-temporal regions of missing data. For the purpose of this competition we developed a variant of the powerful variational autoencoder models dubbed the Conditional Missing data Importance-Weighted Autoencoder (CMIWAE). Our deep latent variable generative model requires little to no feature engineering and does not necessarily rely on the specifics of scoring in the Data Challenge. It is fully trained on incomplete data, with the single objective to maximize log-likelihood of the observed wildfire information. We mitigate the effects of the relatively low number of training samples by stochastic sampling from a variational latent variable distribution, as well as by ensembling a set of CMIWAE models trained and validated on different splits of the provided data. The presented approach is not domain-specific and is amenable to application in other missing data recovery tasks with tabular or image-like information conditioned on auxiliary information.
    PyHHMM: A Python Library for Heterogeneous Hidden Markov Models. (arXiv:2201.06968v1 [cs.MS])
    (0 min) We introduce PyHHMM, an object-oriented open-source Python implementation of Heterogeneous-Hidden Markov Models (HHMMs). In addition to HMM's basic core functionalities, such as different initialization algorithms and classical observations models, i.e., continuous and multinoulli, PyHHMM distinctively emphasizes features not supported in similar available frameworks: a heterogeneous observation model, missing data inference, different model order selection criterias, and semi-supervised training. These characteristics result in a feature-rich implementation for researchers working with sequential data. PyHHMM relies on the numpy, scipy, scikit-learn, and seaborn Python packages, and is distributed under the Apache-2.0 License. PyHHMM's source code is publicly available on Github (https://github.com/fmorenopino/HeterogeneousHMM) to facilitate adoptions and future contributions. A detailed documentation (https://pyhhmm.readthedocs.io/en/latest), which covers examples of use and models' theoretical explanation, is available. The package can be installed through the Python Package Index (PyPI), via 'pip install pyhhmm'.
    Training Fair Deep Neural Networks by Balancing Influence. (arXiv:2201.05759v1 [cs.LG])
    (0 min) Most fair machine learning methods either highly rely on the sensitive information of the training samples or require a large modification on the target models, which hinders their practical application. To address this issue, we propose a two-stage training algorithm named FAIRIF. It minimizes the loss over the reweighted data set (second stage) where the sample weights are computed to balance the model performance across different demographic groups (first stage). FAIRIF can be applied on a wide range of models trained by stochastic gradient descent without changing the model, while only requiring group annotations on a small validation set to compute sample weights. Theoretically, we show that, in the classification setting, three notions of disparity among different groups can be mitigated by training with the weights. Experiments on synthetic data sets demonstrate that FAIRIF yields models with better fairness-utility trade-offs against various types of bias; and on real-world data sets, we show the effectiveness and scalability of FAIRIF. Moreover, as evidenced by the experiments with pretrained models, FAIRIF is able to alleviate the unfairness issue of pretrained models without hurting their performance.
    Who Increases Emergency Department Use? New Insights from the Oregon Health Insurance Experiment. (arXiv:2201.07072v1 [econ.EM])
    (0 min) We provide new insights into the finding that Medicaid increased emergency department (ED) use from the Oregon experiment. Using nonparametric causal machine learning methods, we find economically meaningful treatment effect heterogeneity in the impact of Medicaid coverage on ED use. The effect distribution is widely dispersed, with significant positive effects concentrated among high-use individuals. A small group - about 14% of participants - in the right tail with significant increases in ED use drives the overall effect. The remainder of the individualized treatment effects is either indistinguishable from zero or negative. The average treatment effect is not representative of the individualized treatment effect for most people. We identify four priority groups with large and statistically significant increases in ED use - men, prior SNAP participants, adults less than 50 years old, and those with pre-lottery ED use classified as primary care treatable. Our results point to an essential role of intensive margin effects - Medicaid increases utilization among those already accustomed to ED use and who use the emergency department for all types of care. We leverage the heterogeneous effects to estimate optimal assignment rules to prioritize insurance applications in similar expansions.
    Risk-Monotonicity in Statistical Learning. (arXiv:2011.14126v5 [cs.LG] UPDATED)
    (0 min) Acquisition of data is a difficult task in many applications of machine learning, and it is only natural that one hopes and expects the population risk to decrease (better performance) monotonically with increasing data points. It turns out, somewhat surprisingly, that this is not the case even for the most standard algorithms that minimize the empirical risk. Non-monotonic behavior of the risk and instability in training have manifested and appeared in the popular deep learning paradigm under the description of double descent. These problems highlight the current lack of understanding of learning algorithms and generalization. It is, therefore, crucial to pursue this concern and provide a characterization of such behavior. In this paper, we derive the first consistent and risk-monotonic (in high probability) algorithms for a general statistical learning setting under weak assumptions, consequently answering some questions posed by Viering et al. 2019 on how to avoid non-monotonic behavior of risk curves. We further show that risk monotonicity need not necessarily come at the price of worse excess risk rates. To achieve this, we derive new empirical Bernstein-like concentration inequalities of independent interest that hold for certain non-i.i.d.~processes such as Martingale Difference Sequences.
    Improving the quality control of seismic data through active learning. (arXiv:2201.06616v1 [stat.ML])
    (0 min) In image denoising problems, the increasing density of available images makes an exhaustive visual inspection impossible and therefore automated methods based on machine-learning must be deployed for this purpose. This is particulary the case in seismic signal processing. Engineers/geophysicists have to deal with millions of seismic time series. Finding the sub-surface properties useful for the oil industry may take up to a year and is very costly in terms of computing/human resources. In particular, the data must go through different steps of noise attenuation. Each denoise step is then ideally followed by a quality control (QC) stage performed by means of human expertise. To learn a quality control classifier in a supervised manner, labeled training data must be available, but collecting the labels from human experts is extremely time-consuming. We therefore propose a novel active learning methodology to sequentially select the most relevant data, which are then given back to a human expert for labeling. Beyond the application in geophysics, the technique we promote in this paper, based on estimates of the local error and its uncertainty, is generic. Its performance is supported by strong empirical evidence, as illustrated by the numerical experiments presented in this article, where it is compared to alternative active learning strategies both on synthetic and real seismic datasets.
    Posterior Adaptation With New Priors. (arXiv:2007.01386v3 [cs.LG] UPDATED)
    (2 min) Classification approaches based on the direct estimation and analysis of posterior probabilities will degrade if the original class priors begin to change. We prove that a unique (up to scale) solution is possible to recover the data likelihoods for a test example from its original class posteriors and dataset priors. Given the recovered likelihoods and a set of new priors, the posteriors can be re-computed using Bayes' Rule to reflect the influence of the new priors. The method is simple to compute and allows a dynamic update of the original posteriors.
    A New Look at Dynamic Regret for Non-Stationary Stochastic Bandits. (arXiv:2201.06532v1 [cs.LG])
    (2 min) We study the non-stationary stochastic multi-armed bandit problem, where the reward statistics of each arm may change several times during the course of learning. The performance of a learning algorithm is evaluated in terms of their dynamic regret, which is defined as the difference of the expected cumulative reward of an agent choosing the optimal arm in every round and the cumulative reward of the learning algorithm. One way to measure the hardness of such environments is to consider how many times the identity of the optimal arm can change. We propose a method that achieves, in $K$-armed bandit problems, a near-optimal $\widetilde O(\sqrt{K N(S+1)})$ dynamic regret, where $N$ is the number of rounds and $S$ is the number of times the identity of the optimal arm changes, without prior knowledge of $S$ and $N$. Previous works for this problem obtain regret bounds that scale with the number of changes (or the amount of change) in the reward functions, which can be much larger, or assume prior knowledge of $S$ to achieve similar bounds.
    Incorporating Multiple Cluster Centers for Multi-Label Learning. (arXiv:2004.08113v3 [cs.LG] UPDATED)
    (2 min) Multi-label learning deals with the problem that each instance is associated with multiple labels simultaneously. Most of the existing approaches aim to improve the performance of multi-label learning by exploiting label correlations. Although the data augmentation technique is widely used in many machine learning tasks, it is still unclear whether data augmentation is helpful to multi-label learning. In this article, we propose to leverage the data augmentation technique to improve the performance of multi-label learning. Specifically, we first propose a novel data augmentation approach that performs clustering on the real examples and treats the cluster centers as virtual examples, and these virtual examples naturally embody the local label correlations and label importances. Then, motivated by the cluster assumption that examples in the same cluster should have the same label, we propose a novel regularization term to bridge the gap between the real examples and virtual examples, which can promote the local smoothness of the learning function. Extensive experimental results on a number of real-world multi-label datasets clearly demonstrate that our proposed approach outperforms the state-of-the-art counterparts.
    Minimax risk classifiers with 0-1 loss. (arXiv:2201.06487v1 [stat.ML])
    (2 min) Supervised classification techniques use training samples to learn a classification rule with small expected 0-1-loss (error probability). Conventional methods enable tractable learning and provide out-of-sample generalization by using surrogate losses instead of the 0-1-loss and considering specific families of rules (hypothesis classes). This paper presents minimax risk classifiers (MRCs) that minimize the worst-case 0-1-loss over general classification rules and provide tight performance guarantees at learning. We show that MRCs are strongly universally consistent using feature mappings given by characteristic kernels. The paper also proposes efficient optimization techniques for MRC learning and shows that the methods presented can provide accurate classification together with tight performance guarantees
    On Well-posedness and Minimax Optimal Rates of Nonparametric Q-function Estimation in Off-policy Evaluation. (arXiv:2201.06169v1 [math.ST])
    (2 min) We study the off-policy evaluation (OPE) problem in an infinite-horizon Markov decision process with continuous states and actions. We recast the $Q$-function estimation into a special form of the nonparametric instrumental variables (NPIV) estimation problem. We first show that under one mild condition the NPIV formulation of $Q$-function estimation is well-posed in the sense of $L^2$-measure of ill-posedness with respect to the data generating distribution, bypassing a strong assumption on the discount factor $\gamma$ imposed in the recent literature for obtaining the $L^2$ convergence rates of various $Q$-function estimators. Thanks to this new well-posed property, we derive the first minimax lower bounds for the convergence rates of nonparametric estimation of $Q$-function and its derivatives in both sup-norm and $L^2$-norm, which are shown to be the same as those for the classical nonparametric regression (Stone, 1982). We then propose a sieve two-stage least squares estimator and establish its rate-optimality in both norms under some mild conditions. Our general results on the well-posedness and the minimax lower bounds are of independent interest to study not only other nonparametric estimators for $Q$-function but also efficient estimation on the value of any target policy in off-policy settings.
    Observing how deep neural networks understand physics through the energy spectrum of one-dimensional quantum mechanics. (arXiv:2201.06676v1 [physics.comp-ph])
    (2 min) We investigated how neural networks (NNs) understand physics using one-dimensional quantum mechanics. After training an NN to accurately predict energy eigenvalues from potentials, we used it to confirm the NN's understanding of physics from four different aspects. The trained NN could predict energy eigenvalues of a different potential than the one learned, focus on minima and maxima of a potential, predict the probability distribution of the existence of particles not used during training, and reproduce untrained physical phenomena. These results show that NNs can learn the laws of physics from only a limited set of data, predict the results of experiments under conditions different from those used for training, and predict physical quantities of types not provided during training. Since NNs understand physics through a different path than humans take, and by complementing the human way of understanding, they will be a powerful tool for advancing physics.
    Differentiable Rule Induction with Learned Relational Features. (arXiv:2201.06515v1 [stat.ML])
    (2 min) Rule-based decision models are attractive due to their interpretability. However, existing rule induction methods often results in long and consequently less interpretable set of rules. This problem can, in many cases, be attributed to the rule learner's lack of appropriately expressive vocabulary, i.e., relevant predicates. Most existing rule induction algorithms presume the availability of predicates used to represent the rules, naturally decoupling the predicate definition and the rule learning phases. In contrast, we propose the Relational Rule Network (RRN), a neural architecture that learns relational predicates that represent a linear relationship among attributes along with the rules that use them. This approach opens the door to increasing the expressiveness of induced decision models by coupling predicate learning directly with rule learning in an end to end differentiable fashion. On benchmark tasks, we show that these relational predicates are simple enough to retain interpretability, yet improve prediction accuracy and provide sets of rules that are more concise compared to state of the art rule induction algorithms.
    Online Time Series Anomaly Detection with State Space Gaussian Processes. (arXiv:2201.06763v1 [cs.LG])
    (2 min) We propose r-ssGPFA, an unsupervised online anomaly detection model for uni- and multivariate time series building on the efficient state space formulation of Gaussian processes. For high-dimensional time series, we propose an extension of Gaussian process factor analysis to identify the common latent processes of the time series, allowing us to detect anomalies efficiently in an interpretable manner. We gain explainability while speeding up computations by imposing an orthogonality constraint on the mapping from the latent to the observed. Our model's robustness is improved by using a simple heuristic to skip Kalman updates when encountering anomalous observations. We investigate the behaviour of our model on synthetic data and show on standard benchmark datasets that our method is competitive with state-of-the-art methods while being computationally cheaper.
    Towards Sample-efficient Overparameterized Meta-learning. (arXiv:2201.06142v1 [cs.LG])
    (2 min) An overarching goal in machine learning is to build a generalizable model with few samples. To this end, overparameterization has been the subject of immense interest to explain the generalization ability of deep nets even when the size of the dataset is smaller than that of the model. While the prior literature focuses on the classical supervised setting, this paper aims to demystify overparameterization for meta-learning. Here we have a sequence of linear-regression tasks and we ask: (1) Given earlier tasks, what is the optimal linear representation of features for a new downstream task? and (2) How many samples do we need to build this representation? This work shows that surprisingly, overparameterization arises as a natural answer to these fundamental meta-learning questions. Specifically, for (1), we first show that learning the optimal representation coincides with the problem of designing a task-aware regularization to promote inductive bias. We leverage this inductive bias to explain how the downstream task actually benefits from overparameterization, in contrast to prior works on few-shot learning. For (2), we develop a theory to explain how feature covariance can implicitly help reduce the sample complexity well below the degrees of freedom and lead to small estimation error. We then integrate these findings to obtain an overall performance guarantee for our meta-learning algorithm. Numerical experiments on real and synthetic data verify our insights on overparameterized meta-learning.
    Tolerance and Prediction Intervals for Non-normal Models. (arXiv:2011.11583v5 [stat.ME] UPDATED)
    (2 min) A prediction interval covers a future observation from a random process in repeated sampling, and is typically constructed by identifying a pivotal quantity that is also an ancillary statistic. Analogously, a tolerance interval covers a population percentile in repeated sampling and is often based on a pivotal quantity. One approach we consider in non-normal models leverages a link function resulting in a pivotal quantity that is approximately normally distributed. In settings where this normal approximation does not hold we consider a second approach for tolerance and prediction based on a confidence interval for the mean. These methods are intuitive, simple to implement, have proper operating characteristics, and are computationally efficient compared to Bayesian, re-sampling, and machine learning methods. This is demonstrated in the context of multi-site clinical trial recruitment with staggered site initiation, real-world time on treatment, and end-of-study success for a clinical endpoint.
    Hierarchical VAEs Know What They Don't Know. (arXiv:2102.08248v7 [cs.LG] UPDATED)
    (2 min) Deep generative models have been demonstrated as state-of-the-art density estimators. Yet, recent work has found that they often assign a higher likelihood to data from outside the training distribution. This seemingly paradoxical behavior has caused concerns over the quality of the attained density estimates. In the context of hierarchical variational autoencoders, we provide evidence to explain this behavior by out-of-distribution data having in-distribution low-level features. We argue that this is both expected and desirable behavior. With this insight in hand, we develop a fast, scalable and fully unsupervised likelihood-ratio score for OOD detection that requires data to be in-distribution across all feature-levels. We benchmark the method on a vast set of data and model combinations and achieve state-of-the-art results on out-of-distribution detection.
    Efficient Hyperparameter Tuning for Large Scale Kernel Ridge Regression. (arXiv:2201.06314v1 [cs.LG])
    (2 min) Kernel methods provide a principled approach to nonparametric learning. While their basic implementations scale poorly to large problems, recent advances showed that approximate solvers can efficiently handle massive datasets. A shortcoming of these solutions is that hyperparameter tuning is not taken care of, and left for the user to perform. Hyperparameters are crucial in practice and the lack of automated tuning greatly hinders efficiency and usability. In this paper, we work to fill in this gap focusing on kernel ridge regression based on the Nystr\"om approximation. After reviewing and contrasting a number of hyperparameter tuning strategies, we propose a complexity regularization criterion based on a data dependent penalty, and discuss its efficient optimization. Then, we proceed to a careful and extensive empirical evaluation highlighting strengths and weaknesses of the different tuning strategies. Our analysis shows the benefit of the proposed approach, that we hence incorporate in a library for large scale kernel methods to derive adaptively tuned solutions.
    Contrastive Regularization for Semi-Supervised Learning. (arXiv:2201.06247v1 [cs.LG])
    (2 min) Consistency regularization on label predictions becomes a fundamental technique in semi-supervised learning, but it still requires a large number of training iterations for high performance. In this study, we analyze that the consistency regularization restricts the propagation of labeling information due to the exclusion of samples with unconfident pseudo-labels in the model updates. Then, we propose contrastive regularization to improve both efficiency and accuracy of the consistency regularization by well-clustered features of unlabeled data. In specific, after strongly augmented samples are assigned to clusters by their pseudo-labels, our contrastive regularization updates the model so that the features with confident pseudo-labels aggregate the features in the same cluster, while pushing away features in different clusters. As a result, the information of confident pseudo-labels can be effectively propagated into more unlabeled samples during training by the well-clustered features. On benchmarks of semi-supervised learning tasks, our contrastive regularization improves the previous consistency-based methods and achieves state-of-the-art results, especially with fewer training iterations. Our method also shows robust performance on open-set semi-supervised learning where unlabeled data includes out-of-distribution samples.
    Faster Rates of Private Stochastic Convex Optimization. (arXiv:2108.00331v3 [cs.LG] UPDATED)
    (2 min) In this paper, we revisit the problem of Differentially Private Stochastic Convex Optimization (DP-SCO) and provide excess population risks for some special classes of functions that are faster than the previous results of general convex and strongly convex functions. In the first part of the paper, we study the case where the population risk function satisfies the Tysbakov Noise Condition (TNC) with some parameter $\theta>1$. Specifically, we first show that under some mild assumptions on the loss functions, there is an algorithm whose output could achieve an upper bound of $\tilde{O}((\frac{1}{\sqrt{n}}+\frac{\sqrt{d\log \frac{1}{\delta}}}{n\epsilon})^\frac{\theta}{\theta-1})$ for $(\epsilon, \delta)$-DP when $\theta\geq 2$, here $n$ is the sample size and $d$ is the dimension of the space. Then we address the inefficiency issue, improve the upper bounds by $\text{Poly}(\log n)$ factors and extend to the case where $\theta\geq \bar{\theta}>1$ for some known $\bar{\theta}$. Next we show that the excess population risk of population functions satisfying TNC with parameter $\theta\geq 2$ is always lower bounded by $\Omega((\frac{d}{n\epsilon})^\frac{\theta}{\theta-1}) $ and $\Omega((\frac{\sqrt{d\log \frac{1}{\delta}}}{n\epsilon})^\frac{\theta}{\theta-1})$ for $\epsilon$-DP and $(\epsilon, \delta)$-DP, respectively. In the second part, we focus on a special case where the population risk function is strongly convex. Unlike the previous studies, here we assume the loss function is {\em non-negative} and {\em the optimal value of population risk is sufficiently small}. With these additional assumptions, we propose a new method whose output could achieve an upper bound of $O(\frac{d\log\frac{1}{\delta}}{n^2\epsilon^2}+\frac{1}{n^{\tau}})$ for any $\tau\geq 1$ in $(\epsilon,\delta)$-DP model if the sample size $n$ is sufficiently large.
    (Almost) All of Entity Resolution. (arXiv:2008.04443v3 [stat.ME] UPDATED)
    (2 min) Whether the goal is to estimate the number of people that live in a congressional district, to estimate the number of individuals that have died in an armed conflict, or to disambiguate individual authors using bibliographic data, all these applications have a common theme - integrating information from multiple sources. Before such questions can be answered, databases must be cleaned and integrated in a systematic and accurate way, commonly known as record linkage, de-duplication, or entity resolution. In this article, we review motivational applications and seminal papers that have led to the growth of this area. Specifically, we review the foundational work that began in the 1940's and 50's that have led to modern probabilistic record linkage. We review clustering approaches to entity resolution, semi- and fully supervised methods, and canonicalization, which are being used throughout industry and academia in applications such as human rights, official statistics, medicine, citation networks, among others. Finally, we discuss current research topics of practical importance.
    Selecting time-series hyperparameters with the artificial jackknife. (arXiv:2002.04697v4 [stat.ME] UPDATED)
    (2 min) This article proposes a generalisation of the delete-$d$ jackknife to solve hyperparameter selection problems for time series. I call it artificial delete-$d$ jackknife to stress that this approach substitutes the classic removal step with a fictitious deletion, wherein observed datapoints are replaced with artificial missing values. This procedure keeps the data order intact and allows plain compatibility with time series. This manuscript shows a simple illustration in which it is applied to regulate high-dimensional elastic-net vector autoregressive moving average (VARMA) models.
    Robust Aggregation for Federated Learning. (arXiv:1912.13445v2 [stat.ML] UPDATED)
    (2 min) Federated learning is the centralized training of statistical models from decentralized data on mobile devices while preserving the privacy of each device. We present a robust aggregation approach to make federated learning robust to settings when a fraction of the devices may be sending corrupted updates to the server. The approach relies on a robust aggregation oracle based on the geometric median, which returns a robust aggregate using a constant number of iterations of a regular non-robust averaging oracle. The robust aggregation oracle is privacy-preserving, similar to the non-robust secure average oracle it builds upon. We establish its convergence for least squares estimation of additive models. We provide experimental results with linear models and deep networks for three tasks in computer vision and natural language processing. The robust aggregation approach is agnostic to the level of corruption; it outperforms the classical aggregation approach in terms of robustness when the level of corruption is high, while being competitive in the regime of low corruption. Two variants, a faster one with one-step robust aggregation and another one with on-device personalization, round off the paper.
    Bayesian Calibration of imperfect computer models using Physics-informed priors. (arXiv:2201.06463v1 [stat.ML])
    (2 min) In this work we introduce a computational efficient data-driven framework suitable for quantifying the uncertainty in physical parameters of computer models, represented by differential equations. We construct physics-informed priors for differential equations, which are multi-output Gaussian process (GP) priors that encode the model's structure in the covariance function. We extend this into a fully Bayesian framework which allows quantifying the uncertainty of physical parameters and model predictions. Since physical models are usually imperfect descriptions of the real process, we allow the model to deviate from the observed data by considering a discrepancy function. For inference Hamiltonian Monte Carlo (HMC) sampling is used. This work is motivated by the need for interpretable parameters for the hemodynamics of the heart for personal treatment of hypertension. The model used is the arterial Windkessel model, which represents the hemodynamics of the heart through differential equations with physically interpretable parameters of medical interest. As most physical models, the Windkessel model is an imperfect description of the real process. To demonstrate our approach we simulate noisy data from a more complex physical model with known mathematical connections to our modeling choice. We show that without accounting for discrepancy, the posterior of the physical parameters deviates from the true value while when accounting for discrepancy gives reasonable quantification of physical parameters uncertainty and reduces the uncertainty in subsequent model predictions.
    Risk bounds for PU learning under Selected At Random assumption. (arXiv:2201.06277v1 [math.ST])
    (2 min) Positive-unlabeled learning (PU learning) is known as a special case of semi-supervised binary classification where only a fraction of positive examples are labeled. The challenge is then to find the correct classifier despite this lack of information. Recently, new methodologies have been introduced to address the case where the probability of being labeled may depend on the covariates. In this paper, we are interested in establishing risk bounds for PU learning under this general assumption. In addition, we quantify the impact of label noise on PU learning compared to standard classification setting. Finally, we provide a lower bound on minimax risk proving that the upper bound is almost optimal.
    Low Regret Binary Sampling Method for Efficient Global Optimization of Univariate Functions. (arXiv:2201.07164v1 [cs.LG])
    (2 min) In this work, we propose a computationally efficient algorithm for the problem of global optimization in univariate loss functions. For the performance evaluation, we study the cumulative regret of the algorithm instead of the simple regret between our best query and the optimal value of the objective function. Although our approach has similar regret results with the traditional lower-bounding algorithms such as the Piyavskii-Shubert method for the Lipschitz continuous or Lipschitz smooth functions, it has a major computational cost advantage. In Piyavskii-Shubert method, for certain types of functions, the query points may be hard to determine (as they are solutions to additional optimization problems). However, this issue is circumvented in our binary sampling approach, where the sampling set is predetermined irrespective of the function characteristics. For a search space of $[0,1]$, our approach has at most $L\log (3T)$ and $2.25H$ regret for $L$-Lipschitz continuous and $H$-Lipschitz smooth functions respectively. We also analytically extend our results for a broader class of functions that covers more complex regularity conditions.
    Theoretical analysis and computation of the sample Frechet mean for sets of large graphs based on spectral information. (arXiv:2201.05923v1 [stat.ML])
    (2 min) To characterize the location (mean, median) of a set of graphs, one needs a notion of centrality that is adapted to metric spaces, since graph sets are not Euclidean spaces. A standard approach is to consider the Frechet mean. In this work, we equip a set of graphs with the pseudometric defined by the norm between the eigenvalues of their respective adjacency matrix. Unlike the edit distance, this pseudometric reveals structural changes at multiple scales, and is well adapted to studying various statistical problems for graph-valued data. We describe an algorithm to compute an approximation to the sample Frechet mean of a set of undirected unweighted graphs with a fixed size using this pseudometric.
    Transport away your problems: Calibrating stochastic simulations with optimal transport. (arXiv:2107.08648v2 [physics.data-an] UPDATED)
    (2 min) Stochastic simulators are an indispensable tool in many branches of science. Often based on first principles, they deliver a series of samples whose distribution implicitly defines a probability measure to describe the phenomena of interest. However, the fidelity of these simulators is not always sufficient for all scientific purposes, necessitating the construction of ad-hoc corrections to "calibrate" the simulation and ensure that its output is a faithful representation of reality. In this paper, we leverage methods from transportation theory to construct such corrections in a systematic way. We use a neural network to compute minimal modifications to the individual samples produced by the simulator such that the resulting distribution becomes properly calibrated. We illustrate the method and its benefits in the context of experimental particle physics, where the need for calibrated stochastic simulators is particularly pronounced.
    Tight Regret Bounds for Noisy Optimization of a Brownian Motion. (arXiv:2001.09327v2 [cs.LG] UPDATED)
    (2 min) We consider the problem of Bayesian optimization of a one-dimensional Brownian motion in which the $T$ adaptively chosen observations are corrupted by Gaussian noise. We show that as the smallest possible expected cumulative regret and the smallest possible expected simple regret scale as $\Omega(\sigma\sqrt{T / \log (T)}) \cap \mathcal{O}(\sigma\sqrt{T} \cdot \log T)$ and $\Omega(\sigma / \sqrt{T \log (T)}) \cap \mathcal{O}(\sigma\log T / \sqrt{T})$ respectively, where $\sigma^2$ is the noise variance. Thus, our upper and lower bounds are tight up to a factor of $\mathcal{O}( (\log T)^{1.5} )$. The upper bound uses an algorithm based on confidence bounds and the Markov property of Brownian motion (among other useful properties), and the lower bound is based on a reduction to binary hypothesis testing.
    Proxy-Normalizing Activations to Match Batch Normalization while Removing Batch Dependence. (arXiv:2106.03743v5 [cs.LG] UPDATED)
    (2 min) We investigate the reasons for the performance degradation incurred with batch-independent normalization. We find that the prototypical techniques of layer normalization and instance normalization both induce the appearance of failure modes in the neural network's pre-activations: (i) layer normalization induces a collapse towards channel-wise constant functions; (ii) instance normalization induces a lack of variability in instance statistics, symptomatic of an alteration of the expressivity. To alleviate failure mode (i) without aggravating failure mode (ii), we introduce the technique "Proxy Normalization" that normalizes post-activations using a proxy distribution. When combined with layer normalization or group normalization, this batch-independent normalization emulates batch normalization's behavior and consistently matches or exceeds its performance.
    Targeted VAE: Variational and Targeted Learning for Causal Inference. (arXiv:2009.13472v5 [stat.ML] UPDATED)
    (2 min) Undertaking causal inference with observational data is incredibly useful across a wide range of tasks including the development of medical treatments, advertisements and marketing, and policy making. There are two significant challenges associated with undertaking causal inference using observational data: treatment assignment heterogeneity (\textit{i.e.}, differences between the treated and untreated groups), and an absence of counterfactual data (\textit{i.e.}, not knowing what would have happened if an individual who did get treatment, were instead to have not been treated). We address these two challenges by combining structured inference and targeted learning. In terms of structure, we factorize the joint distribution into risk, confounding, instrumental, and miscellaneous factors, and in terms of targeted learning, we apply a regularizer derived from the influence curve in order to reduce residual bias. An ablation study is undertaken, and an evaluation on benchmark datasets demonstrates that TVAE has competitive and state of the art performance.
    Quantum Ensemble for Classification. (arXiv:2007.01028v3 [cs.LG] UPDATED)
    (2 min) A powerful way to improve performance in machine learning is to construct an ensemble that combines the predictions of multiple models. Ensemble methods are often much more accurate and lower variance than the individual classifiers that make them up but have high requirements in terms of memory and computational time. In fact, a large number of alternative algorithms is usually adopted, each requiring to query all available data. We propose a new quantum algorithm that exploits quantum superposition, entanglement and interference to build an ensemble of classification models. Thanks to the generation of the several quantum trajectories in superposition, we obtain $B$ transformations of the quantum state which encodes the training set in only $log\left(B\right)$ operations. This implies exponential growth of the ensemble size while increasing linearly the depth of the correspondent circuit. Furthermore, when considering the overall cost of the algorithm, we show that the training of a single weak classifier impacts additively the overall time complexity rather than multiplicatively, as it usually happens in classical ensemble methods. We also present small-scale experiments on real-world datasets, defining a quantum version of the cosine classifier and using the IBM qiskit environment to show how the algorithms work.
    Matrix Reordering for Noisy Disordered Matrices: Optimality and Computationally Efficient Algorithms. (arXiv:2201.06438v1 [math.ST])
    (2 min) Motivated by applications in single-cell biology and metagenomics, we consider matrix reordering based on the noisy disordered matrix model. We first establish the fundamental statistical limit for the matrix reordering problem in a decision-theoretic framework and show that a constrained least square estimator is rate-optimal. Given the computational hardness of the optimal procedure, we analyze a popular polynomial-time algorithm, spectral seriation, and show that it is suboptimal. We then propose a novel polynomial-time adaptive sorting algorithm with guaranteed improvement on the performance. The superiority of the adaptive sorting algorithm over the existing methods is demonstrated in simulation studies and in the analysis of two real single-cell RNA sequencing datasets.
    A three layer neural network can represent any multivariate function. (arXiv:2012.03016v2 [cs.LG] UPDATED)
    (2 min) In 1987, Hecht-Nielsen showed that any continuous multivariate function can be implemented by a certain type three-layer neural network. This result was very much discussed in neural network literature. In this paper we prove that not only continuous functions but also all discontinuous functions can be implemented by such neural networks.
    Mixed Logit Models and Network Formation. (arXiv:2006.16516v3 [cs.SI] UPDATED)
    (2 min) The study of network formation is pervasive in economics, sociology, and many other fields. In this paper, we model network formation as a `choice' that is made by nodes in a network to connect to other nodes. We study these `choices' using discrete-choice models, in which an agent chooses between two or more discrete alternatives. We employ the `repeated-choice' (RC) model to study network formation. We argue that the RC model overcomes important limitations of the multinomial logit (MNL) model, which gives one framework for studying network formation, and that it is well-suited to study network formation. We also illustrate how to use the RC model to accurately study network formation using both synthetic and real-world networks. Using synthetic networks, we also compare the performance of the MNL model and the RC model. We find that the RC model estimates the data-generation process of our synthetic networks more accurately than the MNL model. We do a case study of a qualitatively interesting scenario -- the fact that new patents are more likely to cite older, more cited, and similar patents -- for which the RC model allows us to achieve insights.
    Convolutional Signature for Sequential Data. (arXiv:2009.06719v2 [stat.ML] UPDATED)
    (2 min) Signature is an infinite graded sequence of statistics known to characterize geometric rough paths, which includes the paths with bounded variation. This object has been studied successfully for machine learning with mostly applications in low dimensional cases. In the high dimensional case, it suffers from exponential growth in the number of features in truncated signature transform. We propose a novel neural network based model which borrows the idea from Convolutional Neural Network to address this problem. Our model reduces the number of features efficiently in a data dependent way. Some empirical experiments are provided to support our model.
  • Data Science Central

    From Excel to Visualization via Ploomber
    (3 min) This short article is relevant to you if … You’d like to accelerate your development cycles. Your analytics isn’t in the place you want it to be. Your data is too large for Microsoft Excel, or it’s no longer the right tool for you. You seek more reproducibility & automation (downloading, wrangling, saving data) You… Read More »From Excel to Visualization via Ploomber The post From Excel to Visualization via Ploomber appeared first on Data Science Central.
  • The Official NVIDIA Blog

    NVIDIA GPUs Enable Simulation of a Living Cell
    (3 min) Every living cell contains its own bustling microcosm, with thousands of components responsible for energy production, protein building, gene transcription and more. Scientists at the University of Illinois at Urbana-Champaign have built a 3D simulation that replicates these physical and chemical characteristics at a particle scale — creating a fully dynamic model that mimics the Read article > The post NVIDIA GPUs Enable Simulation of a Living Cell appeared first on The Official NVIDIA Blog.
    GFN Thursday: ‘Tom Clancy’s Rainbow Six Extraction’ Charges Into GeForce NOW
    (2 min) Hello, Operator. This GFN Thursday brings the launch of Tom Clancy’s Rainbow Six Extraction to GeForce NOW. Plus, four new games are joining the GeForce NOW library to let you start your weekend off right. Your New Mission, Should You Choose to Accept It Grab your gadgets and get ready to game. Tom Clancy’s Rainbow Read article > The post GFN Thursday: ‘Tom Clancy’s Rainbow Six Extraction’ Charges Into GeForce NOW appeared first on The Official NVIDIA Blog.
    Van, Go: Pony.ai Unveils Next-Gen Robotaxi Fleet Built on NVIDIA DRIVE Orin
    (2 min) Robotaxis are on their way to delivering safer transportation, driving across various landscapes and through starry nights. This week, Silicon Valley-based self-driving startup Pony.ai announced its next-generation autonomous computing platform, built on NVIDIA DRIVE Orin for high-performance and scalable compute. The centralized system will serve as the brain for a robotaxi fleet of Toyota Sienna Read article > The post Van, Go: Pony.ai Unveils Next-Gen Robotaxi Fleet Built on NVIDIA DRIVE Orin appeared first on The Official NVIDIA Blog.
  • AWS Machine Learning Blog

    Detect mitotic figures in whole slide images with Amazon Rekognition
    (16 min) Even after more than a hundred years after its introduction, histology remains the gold standard in tumor diagnosis and prognosis. Anatomic pathologists evaluate histology to stratify cancer patients into different groups depending on their tumor genotypes and phenotypes, and their clinical outcome [1,2]. However, human evaluation of histological slides is subjective and not repeatable [3]. […]
    Distributed fine-tuning of a BERT Large model for a Question-Answering Task using Hugging Face Transformers on Amazon SageMaker
    (11 min) From training new models to deploying them in production, Amazon SageMaker offers the most complete set of tools for startups and enterprises to harness the power of machine learning (ML) and Deep Learning. With its Transformers open-source library and ML platform, Hugging Face makes transfer learning and the latest ML models accessible to the global […]
  • Becoming Human: Artificial Intelligence Magazine - Medium

    Amazon’s AI logistics warehouses
    (4 min) Warehousing and logistics are currently an industry with various complex operations that need flexibility and innovation. This industry is looking to digitally transform companies to improve…
    Getting Started with Hugging Face Transformers for NLP
    (7 min) Hugging Face 🤗: The Best Natural Language Processing Ecosystem You’re Not Using?
  • John D. Cook

    Reading plots of a complex function
    (2 min) This post is about understanding the following plot. If you’d like to read more along these lines, see [1]. The plot was made with the following Mathematica command. ComplexPlot[HankelH1[3, z], {z, -8 - 4 I, 4 + 4 I}, ColorFunction -> "CyclicArg", AspectRatio -> 1/2] The plot uses color to represent the phase of the […] Reading plots of a complex function first appeared on John D. Cook.
    Glasser’s function
    (2 min) I recently ran across an article on MathWorld about Glasser’s function. The function is defined by Here’s a plot of G(x) along with its asymptotic value 2√(x/π). This plot was made with the following Mathematica commands. G[x_] := NIntegrate[Sin[t Sin[t]], {t, 0, x}] Plot[{G[x], 2 Sqrt[x/Pi]}, {x, 0, 20}] According to MathWorld, The integral cannot […] Glasser’s function first appeared on John D. Cook.
  • Artificial Intelligence

    [News] Public Service Announcement: Subreddit for the latest game changing developments in AI/ML! 💙🧑‍💻
    (0 min) PSA: Join r/latestInML for the latest game changing developments in AI/ML. submitted by /u/fullerhouse570 [link] [comments]
    the beauty in art imagined by an A.I. using a CLIP model
    (0 min) submitted by /u/aroe51 [link] [comments]
    Dark Heart - generated by an AI (CLIP and VQGan)
    (0 min) ​ https://preview.redd.it/f17g9s98ewc81.png?width=2368&format=png&auto=webp&s=918c2c6d59e24658719d589423d660723f7033b5 submitted by /u/Hacknaut [link] [comments]
    Your All-in-one Off-grid Power Solution | Solar Generators
    (0 min) submitted by /u/assfatq [link] [comments]
    Researchers at Meta and the University of Texas at Austin Propose ‘Detic’: A Method to Detect Twenty-Thousand Classes using Image-Level Supervision
    (1 min) The difficulty of object detection is divided into two parts: detecting the object (localization) and labeling it (classification). Traditional techniques rely on box labels for all classes since these two sub-problems are intimately coupled. Despite extensive data-gathering efforts, detection datasets are substantially smaller in overall size and object classes (vocabulary) than picture classification datasets. The newest LVIS detection dataset, for example, has 1000+ classes and 120K photos; OpenImages offers 1.8M images for 500 classes. Furthermore, not all classes have enough annotations to train a reliable detector. Even the ten-year-old ImageNet dataset has 21K classes and 14M pictures in categorization. Continue Reading Paper: https://arxiv.org/pdf/2201.02605v2.pdf Github: https://github.com/facebookresearch/Detic https://preview.redd.it/tzcyi9qsdvc81.jpg?width=2678&format=pjpg&auto=webp&s=a2bac58fc9b0f2c7e4e2019bcacc7bf46d2c2d3b submitted by /u/techsucker [link] [comments]
    To read: The alignment problem OR Algorithms to live by - Brian Christian
    (1 min) Making the list of academic and scientific readings for my research thesis (an AI for media articles neutralization) and since time is precious, which one you'd recommend I read from this author? submitted by /u/Odd-Dot3210 [link] [comments]
    AI Predicts How CEO Personality Affects Company Performance
    (0 min) submitted by /u/DaveBowman1975 [link] [comments]
    I am looking for an AI to read and categorize scientific reports
    (1 min) So I have a library of about 17000 geological reports that I want to be ordered according to the areas where these reports were written about. Is there an AI that can perform a task like this? submitted by /u/swartsak [link] [comments]
    Project Recommendations
    (0 min) I am a beginner when it comes to AI. I don't really know where to start to advance my knowledge. I am interested in Physics, chemistry, and Biology. Anyone recommend a cool project that is doable by a beginner like me? submitted by /u/Famous-Bluebird685 [link] [comments]
    Artificial Nightmares : Kindred Wolf || Pytti VQGAN AI Art Video [4K 60 FPS]
    (0 min) submitted by /u/Thenamessd [link] [comments]
    ML/AI Powered Sensitive Data Inspection: First Principles Thinking (Part 1)
    (0 min) submitted by /u/AssociationBusy5717 [link] [comments]
  • Machine Learning

    [R] CM3: A Causal Masked Multimodal Model of the Internet
    (1 min) CM3 is new type of causal model trained on web documents (Wiki/CC-News) containing both text and tokenized images that is capable of zero-shot unconditional image generation (like PixelCNN), conditional image generation (DALL-E), image-infilling, image captioning, SOTA zero-shot summarization, SOTA entity-linking/disambiguation all while being competitive to T5 on fine-tuning. This makes CM3 one of the most, if not the most general single multi-modal zero-shot model trained to date. Arxiv: https://arxiv.org/abs/2201.07520 Twitter: https://twitter.com/ArmenAgha/status/1484117064503549955 submitted by /u/ArmenAg [link] [comments]
    [N] Public Service Announcement: Subreddit for the latest game changing developments in AI/ML! 💙🧑‍💻
    (0 min) PSA: Join r/latestInML for the latest game changing developments in AI/ML. submitted by /u/cv2020br [link] [comments]
    [D] Why Authors Dont Include The Source codes In Their Papers?
    (2 min) Hello, I am from soft robotics field and I am not really expert in machine learning and I am still trying to understand how machine learning can come up with new designs for composites, meta-materials, this paper doesnot include the source code, for me it is vague how to start since I never did this before, I dont know if anyone familiar with something and open source so that I can start learn and build on. Thanks submitted by /u/meldiwin [link] [comments]
    [D] ICLR 2022 RESULTS ARE OUT
    (2 min) Share your rants. submitted by /u/schrodingershit [link] [comments]
    [D] Parameters to use for a basic multiple linear regression
    (1 min) I have a very basic multiple linear regression working as expected in Python with Sklearn. If I have the input csv: Grind, RoastLevel, Time 5.5, 5, 20 5.25, 5, 22 And set the independent variables as Roast Level and Time, and dependent variable as Grind: data_minus_grind = [] grind_data = [] for row in floats: data_minus_grind.append(row[1:3]) grind_data.append(row[0]) When I run the regression to predict the Grind for a Time of 25, I expect a number slightly lower than 5.25, which is what I get: model = LinearRegression().fit(data_minus_grind, grind_data) prediction_data = [[5,25]] prediction = model.predict(prediction_data) The prediction comes out to be 4.875, which sounds about right. Here is the problem. I need this to work in the Dart language, so I'm using basically …
    [D] Which file type do you use the most when training your data
    (1 min) I was curious about the file types you use the most when you train your data for these tasks: - NLP (txt, csv, xlsx...?) - Computer vision | for image: (png, jpg, SVG...?) | for videos : (mp4, avi...?) - Audio processing (mp3, WAV, m4a...?) submitted by /u/_santoshp_ [link] [comments]
    Classification of Higgs boson decays using machine learning [R]
    (1 min) Topic discussed in my project: Study of the Higgs boson Yukawa coupling to tau leptons using the 2012 ATLAS Run-2 dataset. Particular focus is dedicated to the usage of machine learning classification algorithms to classify the Higgs decay channel H to tautau as signal with respect to the other background processes. For the classification have been considered the cases in which there are 0,1 or 2 jets in the final state. This classification has been performed on the free dataset from the Higgs Boson challenge that contains data related to the case in which we have in the final state a tau that decays hadronically and the other one that decays leptonically (data for same leptonic or hadronic decays of the tau are omitted). Where to find my project? GitHub: JustWhit3 / higgs-decay-classification submitted by /u/biancogianluca9 [link] [comments]
    Item2Vec - Word2Vec from gensim wrapped as sklearn estimator for GridSearchCV [P]
    (1 min) Prod2Vec or Item2Vec produces embedding for items in a latent space. The method is capable of inferring item-item relations even when user information is not available. It's based on NLP model Word2Vec. Click here to know more This project provide a class that encapsulates Item2Vec model (word2vec gensim model) as a sklearn estimator. It allows the simple and efficient use of the Item2Vec model by providing : metric to measure the performance of the model (Precision@K) compatibility with GridSearchCV and BayesSearchCV to find the optimal hyperparameters ​ Nice to have feedback ! https://github.com/MathieuCayssol/Item2Vec submitted by /u/Mathieu23AI [link] [comments]
    [D] Encoding method on Logistic Regression
    (1 min) Hi, I would like to know if in Logistic Regression is it possible to only use LabelEncoding and not OneHotEncoding. Can the ordinality applied by the Label Encoding affect the accuracy on Logistic Regression? Since when using the Label Encoder, some features can have higher values than others. Thanks submitted by /u/brych_ant [link] [comments]
    [Discussion] Can someone interpret on the tensorboard graphs of my stylegan2 training?
    (1 min) Hi! I'm training Stylegan2 on V3-8 Tpu and the tensorboard graphics are below but I couldn't interpret it. I have no idea how the training went. Can anyone comment? https://tensorboard.dev/experiment/AQi5Rz0rSDmHgapm9mYYFQ/#scalars&_smoothingWeight=0.855 You can take a look at it in more detail here. https://preview.redd.it/hapy8ppaivc81.png?width=2880&format=png&auto=webp&s=e04b34f3a1f98a39b1d7e1c20a8d3b971315f78f https://preview.redd.it/xkdpvzoaivc81.png?width=2880&format=png&auto=webp&s=446e8cbb2b608bceda25333d4ddb6f029bd4d366 https://preview.redd.it/mtobqzoaivc81.png?width=2880&format=png&auto=webp&s=9914979ba35d0c861c763faea8366238f8553e33 https://preview.redd.it/snf9l1paivc81.png?width=2880&format=png&auto=webp&s=2a793e3314380d4c94cbf64c82c3dd0db9701bbd https://preview.redd.it/nz9p22paivc81.png?width=2880&format=png&auto=webp&s=d3cfff237a34298d3d31d6e3407e5dcef270593b https://preview.redd.it/cjnxa1paivc81.png?width=2880&format=png&auto=webp&s=42cb7ee6826a2db9818779bea00afb788f0dd800 submitted by /u/Euphoric_Ad_1014 [link] [comments]
    [D] Image Augmentation for yolov3 model
    (2 min) I am doing a project to recognize the ASL (American sign language) alphabet with Yolov3, which means that I have 26 labels need to be detected. I read this question on StackOverflow How many images(minimum) should be there in each classes for training YOLO? It said that I should have about 2000 images to reach a decent performance of Yolov3. But I only have about 100 images for each class (after doing the flipping images). The problem with my model from that dataset is that it doesn't work well with dim light (the hand pose of most of the letter seems fine). should I double the dataset I have with the "darker image"-augmention feature? Do I need to find more data to reach the number of 2000? P/s: if anyone does the similar project with Yolov3, Please share your experience, submitted by /u/motnhathongthai [link] [comments]
    [D] Looking for a Machine Learning Course / YT Channel that goes through a complete machine learning project
    (1 min) Hello all, i hope you are well. While I was searching the courses related to machine learning, I have realized that most of the courses use the common datasets like Iris, Titanic vs. I am looking for a course/YT channel that explains concepts through machine learning projects especially regression ones. I need to understand the steps in a multivariate regression model and enrich my study. Can you recommend any youtube channel or course that goes through a real machine learning project? I would be glad if you suggest me something because I feel stucked right now :) submitted by /u/makcin [link] [comments]
    [D] Classifying small metallic objects which are known in advance
    (2 min) I've been trying to classify small metallic objects using CNNs for a few months now. Some of the objects differ in very small details so a lot can look very similarly. In order to train the neural network, I created datasets which are basically composed of images with different perspectives, angles, etc. of the objects. I tried all of the tricks in the book, including data augmentation, and have reached a good accuracy of around 95%. Neural networks don't seem to be, and I found confirmation in literature, invariant in regards to brightness, perspective and so on. However, I do not think this is a proper classification task since the objects are known in advance and the model should only match the past images it saw of the objects with a current one which could differ in perspective, distance, angle, brightness, etc. It's easy to reach good results with neural network since you can resort to overfitting. Does this kind of problem have a name? I thought it would be called object matching, object retrieval, image matching or image retrieval, but it seems I might be wrong. Basically, all of the objects are just laid on the table and the model should recognise all of them. I tried doing object segmentation using simple techniques like background subtraction since the table will always have the same color. It seems to be working quite well. I turned to image retrieval and tried using various feature detectors together with descriptors for the recognition part. I don't know what approach to decide when two images contain the same object. I searched in the literature but didn't find anything good enough. Usually, feature descriptors are used to train an SVM in the classical way either with a bag of words approach or something like VLAD, Fisher vectors are used as a global descriptor using cosine similarity or again training an SVM. Are there other possible approaches to this kind of problem? Any suggestions are appreciated. submitted by /u/metarenderer [link] [comments]
    [D] Are there unconventional cognition architectures that learn without SGD, weights between neurons, or can only be done on the CPU?
    (2 min) I suspect the ML rush in recent years with GPUs getting faster has created a blindspot where people are always thinking through the GPU lens, thinking that weights and synapses and big GPU processing is as good as it gets. What do you guys think? Could there be unconventional architectures that are better suited for the CPU with its flexibility of programming jumping around memory? Some other way to represent knowledge submitted by /u/ryunuck [link] [comments]
    [D] Accuracy improvement tweak for decision tree classifiers
    (0 min) A note describing a relatively simple tweak to squeeze more accuracy from decision tree classifiers: Backtrack Tie-Breaking for Decision Trees I did not find this mentioned in texts on decision tree classifiers. Is the subject overlooked or there are some inherent drawbacks associated to it? submitted by /u/crispub [link] [comments]
    [D] Autonomous weapons are here and the world is divided over their use
    (4 min) In 2020 a lethal autonomous weapon was used for the first time in an armed conflict - the Turkish-made drone - Kargu-2 - in Libya's civil war. In recent years, more weapon systems have incorporated elements of autonomy but they still rely on a person to launch an attack. But advances in AI, sensors, and electronics have made it easier to build more sophisticated autonomous systems, raising the prospect of machines that can decide on their own when to use lethal force. A growing list of countries, including Brazil, South Africa, New Zealand, and Switzerland, argue that lethal autonomous weapons should be restricted by treaty, as chemical and biological weapons have been. China supports an extremely narrow set of restrictions. Other nations, including the US, Russia, India, the UK, and Australia, object to a ban on lethal autonomous weapons arguing that they need to develop the technology to avoid being placed at a strategic disadvantage. This is no longer stuff of the future though. Source: December issue of mindkosh.com/mindkosh-ai-review-newsletter.html submitted by /u/ifcarscouldspeak [link] [comments]
    [D] Experience with training
    (1 min) How do you know if a model has learned the problem? I use Ray these days to train and watch the Tensorboard metrics to look for rewards move up. The problem that I have is that the rewards can be maximised indefinitely, so I cannot have an early stop. I also cannot accept “no negative rewards” as good enough, cause the model could simply stop doing anything and that’d be enough to stop the training. (The negative rewards are also expected) Is there any metric (ex. policy loss? combined with entropy loss?) to watch for in order to interpret the quality of the model’s “understanding” of the problem? I use PPO, and have tried A2C/A3C. The model works against a generated polynomial data, but the real data has more noise, which is another issue to tackle down the road.. Thank you all. submitted by /u/lawabidingcitizen-nz [link] [comments]
    [D] Can a Siamese Neural Network work for invoice classification?
    (2 min) I am working on a task that consists of different shipping invoice. I have to build a model to classify them. I have one example for each invoice so I am thinking will a Siamese Neural Network do the job I am open to other methods too but this is the only thing that struck my mind. If you'll have any other idea please let me know submitted by /u/__Spee__ [link] [comments]
    [D] How to backprop through an arbitrary unknown function?
    (1 min) I am familiar with all the chain rule, vector Jacobian product & Jacobian product vector stuff in textbooks where you can usually derive the solution with calculus. But what are sound ways to backprop through an arbitrary function? Say f(x) is the order statistics of x (array) or something that you could use the function to get values but no idea of how it works internally. Could someone enlighten me? submitted by /u/shenglih [link] [comments]
    [D] Intuition behind GAN augmentation
    (7 min) I've seen several papers that train a conditional GAN on a labelled training set and then use that cGAN to generate more training data for a supervised model, leading to better performance than the baseline approach of just training the supervised model on the original data. Sometimes there's an extra step to throw away any garbage outputs from the cGAN, but this usually doesn't require any extra supervision. I used to see this as a "free lunch", but now I think I understand why it works, and I'd like to double-check my understanding. The key intuition (I think) is that generative models tend to generalize better than discriminative models when training data is limited. A cGAN is literally a generative model that fits to the distribution Pr(X | y), and training a supervised model on its output is just a fancy way of computing the posterior Pr(y | X). So a supervised model trained on a cGAN output is equivalent to a generative model, so it generalizes better from limited data. Is my understanding correct? Assuming it is, I could still use some more intuition on why a cGAN in particular generalizes better. My understanding is that generative models make more assumptions about the world, e.g. Naive Bayes assumes independent features. So how does that apply to a cGAN? Is it just the fact that it has to generate images, whereas supervised models only have to predict from them? It's kind of hard to see an intuition when cGANs and supervised models both look like big piles of linear algebra. submitted by /u/MeyerLouis [link] [comments]
  • Neural Networks, Deep Learning and Machine Learning

    Let's Enhance - 2022 enhanced edition
    (0 min) submitted by /u/3dforlife [link] [comments]
    Hi, I'm considering starting to mess with neural networks, where do I start? My final goal for now is a wake word detector
    (2 min) (I did try using pytorch once, but found it too difficult) submitted by /u/Hananelroe [link] [comments]
  • Reinforcement Learning

    Wordle Gym Environment
    (1 min) I created a Gym environment for playing Wordle (gym-wordle). Would love feedback and/or suggestions for how to make it better. This was mainly an exercise for me in creating a custom Gym environment (my first one). I'm now kicking around ideas for what kind of RL models would be suited for this game. Repo: https://github.com/zach-lawless/gym-wordle submitted by /u/znipdawg [link] [comments]
    Are 100 to 1000 actions large number of action space for DQN?
    (1 min) I am training a DQN for devices control. I have N devices (less than 10)and each devices has M gears(M=8). So the number of action space is M^N. Now i am going to training a 512 actions DQN, is this action number too large? Is there any way to deal with too many actions?The effections of different actions are similar, such as in 3 devices , (1,6,7)≈(1,7,6). submitted by /u/bakurawang [link] [comments]
    Python libraries for solving reinforcement learning problems implemented in OpenAI gym
    (2 min) I have implemented a reinforcement problem with Open AI gym in Python and I would like to solve it using different algorithms like Deep-Q-Learning and SARSA (any maybe others). My question is which libaries are available for solving the problem in python? I was using the Keras-rl library but it seems that it is no longer supported as it generates an error message when using the code posted below. Which libraries are suitalbe for solving problems implemented in OpenAI gym? #%% import from gym import Env from gym.spaces import Discrete, Box, Tuple import numpy as np #%% class Custom_Env(Env): def __init__(self): # Define the state space #State variables self.state_1 = 0 self.state_2 = 0 self.state_3 = 0 self.state_4_currentTimeSlots = 0 #Define the gym components self.action_space = Box(lo…
    RL As A Service Companies?
    (1 min) Are there any that you'd recommend? submitted by /u/santeryl [link] [comments]
    Modifying Mujoco environment
    (1 min) I have a project where i have to modify a mujoco environment - Doubleinvertedpendulum into a closed loop kinematic chain, thus i have change the xml file to include a second cart with the end poles joined with the poles of the initial cart, as well as fixing the positions of the carts and only allowing rotational movement. However i am still using the environment file from the original mujoco environment with a single cart. My question is: if i were to change the xml file, does mujoco somehow modifies the environment dynamics (etc how reward is given) according to how the xml file is modified? Or do i have to modifiy the environment file myself? submitted by /u/Lifeisunfair210 [link] [comments]
    Need help!!!
    (1 min) How to determine which q table to use after finishing training? (Double Q-Learning) submitted by /u/Professional_Card176 [link] [comments]
    Conservative Q Learning TD error not converging
    (1 min) Hi, I am using the discrete conservative Q learning implementation in the d3rlpy library (https://github.com/takuseno/d3rlpy) to train a policy offline to optimize mechanical ventilation treatment by using the MIMIC-III dataset (https://physionet.org/content/mimiciii-demo/1.4/). The state space for my problem is a set of 38 measurements taken from the MIMIC-III dataset such as heartrate, blood pressure, etc. The action space is a combination of 3 settings (Positive end-expiratory pressure, fraction of inspired oxygen and adjusted tidal volume) on the mechanical ventilator which have been binned to make the action space discrete. The baseline I am comparing it to is Double DQN (Double Deep Q Network) since the library builds its conservative Q learning implementation on top of Double DQN. My baseline Double DQN seems to converge just fine but whatever hyperparameters I try for Discrete Conservative Q Learning (Discrete CQL), the TD error always diverges. Am I doing something wrong? Is it possible that the policy is converging while the TD error is diverging? This is my first RL project so I am very new to all this and any help at all is really appreciated. Here is a screenshot of the graphs for some of my hyperparameters (gamma is the discount factor and update_interval is the update interval for the target network): https://imgur.com/ErJfUnJ. submitted by /u/superkaiba [link] [comments]

2022-01-19

  • Data Science Central

    A Framework for Understanding Transformation–Part II
    (5 min) In Part I of this series, we discussed the VUCA environment in which business are operating and the impact this has had and will continue to have on how businesses are run, discussed the need for business agility and introduced the EA model that Agile ERM employs.  In this article, we will look at three… Read More »A Framework for Understanding Transformation–Part II The post A Framework for Understanding Transformation–Part II appeared first on Data Science Central.
    Quasi A/B test: An overlooked opportunity in company-sponsored research
    (5 min) Andrea Polonioli & Ciro Greco Leading companies have been seizing the unprecedented opportunities offered by the web to test hypotheses quickly by using controlled experiments, typically called A/B tests. A/B tests are now ubiquitous: when you open up your Netflix app, there is a good chance you are part of one or more experiments while… Read More »Quasi A/B test: An overlooked opportunity in company-sponsored research The post Quasi A/B test: An overlooked opportunity in company-sponsored research appeared first on Data Science Central.
    DSC Weekly Digest for January 18, 2022: Web 3 Isn’t About Blockchain … It’s About Gaming
    (7 min) Microsoft announced today the purchase of Activision / Blizzard, creators of such game titles as World of Warcraft, Diablo, Spyro, Call of Duty, and many other titles, expanding its reach in the gaming sector significantly while also putting yet another major stake in the Metaverse ground. The deal, at $69 billion dollars, is one of… Read More »DSC Weekly Digest for January 18, 2022: Web 3 Isn’t About Blockchain … It’s About Gaming The post DSC Weekly Digest for January 18, 2022: Web 3 Isn’t About Blockchain … It’s About Gaming appeared first on Data Science Central.
  • The Official NVIDIA Blog

    New NVIDIA AI Enterprise Release Lights Up Data Centers
    (5 min) With a new year underway, NVIDIA is helping enterprises worldwide add modern workloads to their mainstream servers using the latest release of the NVIDIA AI Enterprise software suite. NVIDIA AI Enterprise 1.1 is now generally available. Optimized, certified and supported by NVIDIA, the latest version of the software suite brings new updates including production support Read article > The post New NVIDIA AI Enterprise Release Lights Up Data Centers appeared first on The Official NVIDIA Blog.
    Fusing Art and Tech: MORF Gallery CEO Scott Birnbaum on Digital Paintings, NFTs and More
    (3 min) Browse through MORF Gallery — virtually or at an in-person exhibition — and you’ll find robots that paint, digital dreamscape experiences, and fine art brought to life by visual effects. The gallery showcases cutting-edge, one-of-a-kind artwork from award-winning artists who fuse their creative skills with AI, machine learning, robotics and neuroscience. Scott Birnbaum, CEO and Read article > The post Fusing Art and Tech: MORF Gallery CEO Scott Birnbaum on Digital Paintings, NFTs and More appeared first on The Official NVIDIA Blog.
  • MIT News - Artificial intelligence

    When should someone trust an AI assistant’s predictions?
    (7 min) Researchers have created a method to help workers collaborate with artificial intelligence systems.
  • Becoming Human: Artificial Intelligence Magazine - Medium

    Voice-Based Gender Identification using SVM
    (10 min) Let’s find out whether this machine learning model is able to distinguish male and female voices.
    Is AI taking quality and cost optimization of enterprise services to the next level?
    (6 min) Dr Adrian Engelbrecht, Product & Development Lead, Serviceware AI, looks at how AI is taking quality and cost optimization of enterprise services to the next level. Business and service leaders are…
    Mining Associations & Frequent Patterns
    (7 min) Imagine that you are a sales manager at Big Electronics Shop, and you are talking to a customer who recently bought a PC and a digital…
  • John D. Cook

    Upper bound on wait time for general queues
    (1 min) The previous post presented an approximation for the steady-state wait time in queue with a general probability distribution for inter-arrival times and for service times, i.e. a G/G/s queue where s is the number of servers. This post will give an upper bound for the wait time. Let σ²A be the variance on the inter-arrival […] Upper bound on wait time for general queues first appeared on John D. Cook.
  • Microsoft Research

    DeepSpeed: Advancing MoE inference and training to power next-generation AI scale
    (21 min) In the last three years, the largest trained dense models have increased in size by over 1,000 times, from a few hundred million parameters to over 500 billion parameters in Megatron-Turing NLG 530B (MT-NLG). Improvements in model quality with size suggest that this trend will continue, with larger model sizes bringing better model quality. However, […] The post DeepSpeed: Advancing MoE inference and training to power next-generation AI scale appeared first on Microsoft Research.
  • Artificial Intelligence

    Is there a AI, which is able to generate realistic images, no matter from what?
    (0 min) submitted by /u/xXNOdrugsForMEXx [link] [comments]
    Innovating Meaningful & Impactful Health System Transformation - Fanny Sie - One Roche Head of Artificial Intelligence and Digital Health, F. Hoffmann La Roche
    (1 min) submitted by /u/ObjectiveGround5 [link] [comments]
    What an Ai thinks the Broken Peace is.
    (1 min) submitted by /u/Crow19852 [link] [comments]
    Researchers Introduce ‘SeMask’: An Effective Transformer-Framework That Incorporates Semantic Information Into The Encoder With The Help Of A Semantic Attention Operation
    (1 min) After demonstrating the transformer’s efficiency in the visual domain, the research community has focused on extending its use to several fields. One of these is semantic segmentation, a critical application for many areas, such as autonomous drive or medical diagnosis. The classical approach to this topic has been to use an existing pre-trained Transformer Layer as an encoder, tuning it for the segmentation task. However, this approach lacks insight into the semantic context during fine-tuning due to the relatively small dataset size compared to the one used for pre-training. The Picsart AI Research team, together with the SHI Lab and IIT Roorkee, proposed the SeMask framework to incorporate semantic information into a generic hierarchical vision transformer architecture (such as Swin Transformer) through two techniques. First, a Semantic Layer is added after the Transformer Layer; second, two decoders are used: a lightweight semantic decoder used only for training and a feature decoder. The comparison of the existing methods general pipeline and SeMask is shown in the image below. Paper Summary: https://www.marktechpost.com/2022/01/19/researchers-introduce-semask-an-effective-transformer-framework-that-incorporates-semantic-information-into-the-encoder-with-the-help-of-a-semantic-attention-operation/ Paper: https://arxiv.org/abs/2112.12782 Github: https://github.com/Picsart-AI-Research/SeMask-Segmentation ​ https://preview.redd.it/cw5v6fsfzoc81.png?width=1206&format=png&auto=webp&s=0c171b2edc18d98a684f3bfc0882bf4f80383da8 submitted by /u/techsucker [link] [comments]
    How AI Can Solve Customer Behavior Analysis
    (0 min) submitted by /u/Visionifyai [link] [comments]
    [R] Less is More: Understanding Neural Network Decisions via Simplified Yet Informative Inputs
    (1 min) A research team from University Medical Center Freiburg, ML Collective, and Google Brain introduces SimpleBits — an information-reduction method that learns to synthesize simplified inputs that contain less information yet remain informative for the task, providing a new approach for exploring the basis of network decisions. Here is a quick read: Less is More: Understanding Neural Network Decisions via Simplified Yet Informative Inputs. The paper When Less is More: Simplifying Inputs Aids Neural Network Understanding is on arXiv. submitted by /u/Yuqing7 [link] [comments]
    Papers With Code's Coolest AI Publication of 2021 Explained: ADOP - Create Smooth Videos from Images!
    (1 min) submitted by /u/OnlyProggingForFun [link] [comments]
    Top 10 Machine Translation APIs
    (0 min) submitted by /u/tah_zem [link] [comments]
    What AIs are there which are able to generate images which look realistic?
    (1 min) There are so many cool AIs which can make art, or edit images now I am asking myself if it is possible for AI to generate realistic images no matter from what. 2 Question: Is there a list with all AIs out there and what they can do? submitted by /u/xXNOdrugsForMEXx [link] [comments]
    What AI could I use to turn images from faces into drawings/sketches?
    (1 min) The AI could coast money. submitted by /u/xXLisa28Xx [link] [comments]
    https://www.youtube.com/watch?v=mSRrtr497do
    (0 min) submitted by /u/Thenamessd [link] [comments]
    CALTECH CMTE AI ML BOOTCAMP
    (0 min) I’m looking into enrolling in this bootcamp: https://pg-p.ctme.caltech.edu/online-ai-machine-learning-bootcamp-live-training. Has anybody taken this bootcamp? Any thoughts? The modules are thought by CALTECH and IBM. submitted by /u/adraxediem [link] [comments]
    Alan Watts predicted Neuralink over 60 years ago. (Awesome video)
    (0 min) submitted by /u/Wisdom369 [link] [comments]
    Artificial Nightmares : Shrek Is Human || Pytti VQGAN AI Art Video [4K 60 FPS]
    (0 min) submitted by /u/Thenamessd [link] [comments]
    Artificial General Intelligence Conference in 2022
    (2 min) submitted by /u/akolonin [link] [comments]
    Generating Natural Language Training Data from Physics Simulations in Unity
    (0 min) submitted by /u/itsallshit-eatup [link] [comments]
    Radiohead - Pyramid Song - AI Video Art
    (0 min) submitted by /u/glenniszen [link] [comments]
  • Machine Learning

    [D] Big differences between swarm learning and federated learning
    (1 min) A fairly new paper in Nature (https://www.nature.com/articles/s41586-021-03583-3) in 2021 introduces a new framework that can deal with decentralized data called swarm learning. They said that the biggest innovation of swarm learning is that there is no central node that aggregates the data. However, according to their supplementary, at the time when they "share the learning from the individual models, one of the swarm nodes is dynamically elected as a leader...the elected leader node collects the model parameters from each peer node and merges them... the nodes use the merged parameters to start the next training batch". In my opinion, although swarm learning does not have an explicit central node, each time when the model needs to share the parameters one of the local nodes is chosen as a center, which can be seen as an implicit central node. Besides, I don't know how to keep the privacy in this setting because one local node will receive all the parameters of the other nodes, not the average of the parameters as the standard federated learning setting. Also, the authors use the blockchain as the network and say that blockchain smart contracts prevent attacks from semi-honest or dishonest participants. I am not very familiar with the blockchain so I am a little bit confused about how it works in detail and what the advantage of the blockchain is. submitted by /u/mike_shen [link] [comments]
    [D] How to train a NeRF in seconds explained - Instant Neural Graphics Primitives with a Multiresolution Hash Encoding - 5-minute paper summary (by Casual GAN Papers)
    (1 min) If you liked the 100x NeRF speed up from a month ago, you definitely will love this fresh new way to train NeRF 1000x faster proposed in a paper by Thomas Müller and the team at Nvidia that utilizes a custom data structure for input encoding that is implemented as CUDA kernels highly optimized for the modern GPUs. Specifically, the authors propose to learn a multiresolution hashtable that maps the query coordinates to feature vectors. The encoded input feature vectors are passed through a small MLP to predict the color and density of a point in the scene, NeRF-style. How does this help the model to fit entire scenes in seconds? Let’s learn! Full summary: https://t.me/casual_gan/239 Blog post: https://www.casualganpapers.com/fastest_nerf_3d_neural_rendering/Instant-Neural-Graphics-Primitives-explained.html Instant NeRF arxiv / code Subscribe to Casual GAN Papers and follow me on Twitter for weekly AI paper summaries! submitted by /u/KirillTheMunchKing [link] [comments]
    [R] AutoInit Software for Model Initialization
    (1 min) AutoInit is a weight initialization method that automatically adapts to different neural network architectures. It tracks the mean and variance of signals as they propagate through the network and initializes the weights at each layer to avoid exploding or vanishing signals. AutoInit can be used to improve performance of feedforward, convolutional, and residual networks; configured with different activation function, dropout, weight decay, learning rate, and normalizer settings; and applied to vision, language, tabular, multi-task, and transfer learning domains. The software package provides a simple wrapper that makes it possible to apply AutoInit to existing TensorFlow models as-is. We invite you to try it out and see if it can improve the performance of your neural network models! For further details, see - GitHub repo: https://github.com/cognizant-ai-labs/autoinit - arXiv paper: https://arxiv.org/abs/2109.08958 AutoInit is also available through the Cognizant AI Labs Software page, together with related software on estimating model uncertainty, multitasking, loss-function metalearning, decision making, and model management, at - https://evolution.ml/software -- Garrett Bingham & Risto Miikkulainen submitted by /u/thedeepevolution [link] [comments]
    [D] ICML abstract deadline vs ICLR results date
    (1 min) This year, the ICML abstract deadline (Jan 20th) is before ICLR results come out (Jan 24th). Does anyone know if this was done purposefully to prevent resubmissions of rejected papers from ICLR? FWIW I don't think it's a bad idea, just curious if there was some public discussion about this. For context, last year the ICML abstract deadline was 2 weeks after ICLR results came out. submitted by /u/lit_turtle_man [link] [comments]
    [R] Does anyone have any experience renting out a whole Nvidia A100 bare metal? Who did you go through? What was your experience like?
    (1 min) Just like the title says, looking to hear from anyone who has rented a bare metal A100 before and wondering how it went and why you decided to go with bare metal - thanks! submitted by /u/After-Surfree [link] [comments]
    [R] Accelerating ML-based analytics queries with indexes
    (1 min) Running queries over unstructured data? Our new work on indexes, TASTI, can accelerate your queries by up to 20x! We describe our work in part 5 of our blog post series on accelerating queries over unstructured data and has been accepted to SIGMOD 2022: https://dawn.cs.stanford.edu/2022/01/18/tasti/ (full paper: https://arxiv.org/abs/2009.04540) TASTI does this by leveraging redundancy in datasets: it labels a small amount of data and learns distances from unlabeled data points to the labeled ones. It can then generate proxy scores per-query without training a new model! See our blog post for full details: https://dawn.cs.stanford.edu/2022/01/18/tasti/ submitted by /u/danieldkang [link] [comments]
    [D] Did you also feel that Snorkel's LabelModel is really slow?
    (0 min) Has anyone here used Snorkel AI's LabelModel for automatically labeling text? Have you found it to be super slow? submitted by /u/freaky_eater [link] [comments]
    [D] Predict video frames using Generative Adversarial Networks (GANs)
    (1 min) So, I have a project where it is desirable to predict time sequence of images (like frames of a video) as output of a neural network starting from an another time sequence of different images. From what I understood, I cannot use VAE as the input images and output images are representation of different things (like cats and dogs). I was investigating the use of GAN for this purposes. However, I never implemented something like this before (and I read training GANs can be very painful and unstable). Checking some works , I have seen this is possible in practice starting from past frames as inputs. However, this is not my case. Do you have any suggestion on how to breakdown the problem? submitted by /u/Betelgeuse999 [link] [comments]
    [D] First Author Interview - Noether Networks: Meta-Learning Useful Conserved Quantities (Paper Explained Video & Interview)
    (1 min) https://youtu.be/Xp3jR-ttMfo This video includes an interview with first author Ferran Alet! Encoding inductive biases has been a long established methods to provide deep networks with the ability to learn from less data. Especially useful are encodings of symmetry properties of the data, such as the convolution's translation invariance. But such symmetries are often hard to program explicitly, and can only be encoded exactly when done in a direct fashion. Noether Networks use Noether's theorem connecting symmetries to conserved quantities and are able to dynamically and approximately enforce symmetry properties upon deep neural networks. ​ OUTLINE: 0:00 - Intro & Overview 18:10 - Interview Start 21:20 - Symmetry priors vs conserved quantities 23:25 - Example: Pendulum 27:45 - Noether Network Model Overview 35:35 - Optimizing the Noether Loss 41:00 - Is the computation graph stable? 46:30 - Increasing the inference time computation 48:45 - Why dynamically modify the model? 55:30 - Experimental Results & Discussion ​ Paper: https://arxiv.org/abs/2112.03321 Website: https://dylandoblar.github.io/noether-networks/ Code: https://github.com/dylandoblar/noether-networks submitted by /u/ykilcher [link] [comments]
    [D] Generalization in time-series data and the underlying assumptions of ML models.
    (1 min) I've been refreshing some of my groundwork theoretical knowledge on statistics and generalization; So basically all "classical" statistical models assume that the data the model would be predicting comes from the same distribution as the data the model was trained on. This in turn gives us concrete figures on the generalization of our model. But what about time-series? Time-series datasets are inherently not drawn from the same distribution. So I suppose this means I can't have figures on generalization? How about applying models that assume a certain distribution of the training dataset (like Linear Regression)? submitted by /u/bloodmummy [link] [comments]
    [P] Can't finish my master's thesis. What to do?
    (10 min) I am not sure if that is the place to do this post. But there it goes. I am on the last semester of my Ma degree in AI. I am still working on some courses but I will probably have no problem finishing them. So I ll be left only with the thesis to work on. I have less than one month to finish, but it is possible to ask for an extension of 3 months.An important thing is that I am doing that in cooperation with a startup. Because of that my 'real' advisor is my boss at this startup. I have exchanged only a couple of emails with my advisor from university. The problem is that I am stuck in my project. It is about generating synthetic data for training neural networks that recognize numbers in the back of soccer players. But I feel that I am wasting time in a proposal that will never work, be…
    [D]Where to find current research on rating systems
    (1 min) Hello /r/MachineLearning, I have a strong interest in learning more about the state of the art for rating systems. I know about simple ones such as Elo, and Bradley Terry models, as well as some more advanced ones like Glicko, TrueSkill, WHR, and Melo, as well as some types of learning to rank models that can be applied to rating. I was wondering if there is some terminology for this type of thing or any central location where research in this area is published like a certain conference or journal. I have found what I know so far among stats journals usually mentioning "paired comparisons", some at ML conferences like Neurips and some just on arxiv or online. My best resource so far to find new works in this area is to go on google scholar and try to find the most recent papers which cite TrueSkill or Glicko. Does anyone here know a good way to find the most recent research in this niche area? Thanks! submitted by /u/cthorrez [link] [comments]
    [P] I made a software package for the N-Beats neural network architecture in keras
    (1 min) Hi everyone! I work in time series and forecasting a lot at work, and recently developed a user-friendly version of the N-Beats architecture with keras, since I didn't think there were good choices here. You can see the documentation here: https://kerasbeats.readthedocs.io/en/latest/ The package provides a high level wrapper over a keras model, but also gives you the ability to work directly with the model layer if you so please. I don't know that this is going to set the world of DL on fire, but I've been using it at work already with good results, and it provides an easy way to configure the model architecture defined in the paper here: https://arxiv.org/abs/1905.10437. submitted by /u/jonathanbechtel [link] [comments]
    [Discussion] Good software to review annotations from a CSV file ;
    (1 min) Hi! I am annotating pictures at VGG VIA which works great. But it is difficult to double check my annotations. In other softwares there are color codes etc to make it easier, but in VIA there is not. ‏‏‎ Is there a software to annotate that i could upload my work (CSV) from VIA to do review more easily? (I have 18,000 pictures to review) Thanks! submitted by /u/yellowgorilla556 [link] [comments]
  • Neural Networks, Deep Learning and Machine Learning

    Instant Neural Graphics Primitives with a Multiresolution Hash Encoding
    (0 min) submitted by /u/nickb [link] [comments]
  • Reinforcement Learning

    When should discretization of observations be considered?
    (1 min) I saw some paper that discuss the shape the action space in various environments should have. However, I haven’t seen that much regarding observations. I’m not talking about normalisation of observations, as NNs are mostly used as function approximators in reinforcement learning. I was curious about this because a policy is basically a mapping of actions to states and discretizing the observations will eventually lead to less mapping that has to be learned. Of course this could lead to lower peak performance, but I can also imagine higher training speed. Or is my understanding of it wrong? Does anyone have a reference concerning this topic? submitted by /u/kitaird [link] [comments]
    Soft Actor-Critic: termination step
    (1 min) Hi all! I'm implementing the SAC algorithm for a custom environment. The agent controls a creature that is not supposed to die or reach any final step. And this affects how I train Q-networks: there is no state where the target is just a reward without transition: y = r + γ*(1 - done)*(min(q1Targ(s', a'), q2Targ(s', a')) - α*log(π(s')) here "done" is always 0. Will my policy converge in this case? submitted by /u/CharlottHebb [link] [comments]
    How Amazon Ads uses reinforcement learning
    (0 min) submitted by /u/ordinarytime [link] [comments]
    [newbie] How to make a reward function to solve a toy bit flipping problem?
    (2 min) I am new to RL and am trying the following toy problem using [stable baselines 3][1]. We start with a random array of bits (that is values that are either 1 or 0). The only actions are to flip one of the bits and the goal is to set them all to 1. If I make the input list of length 10 then it works fine using the following simple modification of the tutorial code: ​ import numpy as np import gym from gym import spaces import random class BitFlipEnv(gym.Env): """ Custom Environment that follows gym interface. This is a simple env where the agent must learn to go always left. """ ​ metadata = {'render.modes': ['console']} def __init__(self, array_length=10): super(BitFlipEnv, self).__init__() # Size of the 1D-grid self.array_length = array_length self.agent_pos = random.choic…
    [D] DDPG not converging were actor critic and DQN are - any advice?
    (2 min) Hi all, I am following the keras DDPG workflow https://keras.io/examples/rl/ddpg_pendulum/ for a custom environment I have built which is a graph optimisation problem in which a removal or addition of connection between nodes (as done by the agent) results in a new state. the reward system is based on how beneficial node removal was as measured by a metric. The state is represented by a node embedding (resulting from a node2vec representation of that node at each step) or a CNN/GNN of the new graph at each step (i am experimenting with both)... I have used a smaller version of the graph network (100 nodes) as a test for the exercise - both DQN and actor critic methods converge quickly. In my attempt to scale up the problem to a larger graph, I have changed this to a continuous task and chosen action selection as two dials with a range between 0-n nodes (each dial selecting a node after which the connection will be changed between those nodes) - I am still doing this on the smaller graph just to see that it works. unfortunately, I am having issues where there is no convergence! even though it has seen positive rewards several times - while vague, I was wondering if there are any glaring obvious things that you might advise for why this be happening and what should be tweaked/changed - A couple observations - The agent goes to the extrema of action selection practically immediately (-1,1) (note tanh is untweaked as I simply scale up the actions inside the environment, but it still returns the actions in range -1,1) - replay buffer contains actions in -1,1 range and rewards are in range -1,1. One thing i am unsure about is how much standard deviation i should for noise - any advise is much appreciated, I am happy to share snippets of code if anyone thinks there are particular parts that might be relevant - thank you! submitted by /u/amjass12 [link] [comments]

2022-01-18

  • cs.LG updates on arXiv.org

    Artificial Intelligence in Software Testing : Impact, Problems, Challenges and Prospect. (arXiv:2201.05371v1 [cs.SE])
    (2 min) Artificial Intelligence (AI) is making a significant impact in multiple areas like medical, military, industrial, domestic, law, arts as AI is capable to perform several roles such as managing smart factories, driving autonomous vehicles, creating accurate weather forecasts, detecting cancer and personal assistants, etc. Software testing is the process of putting the software to test for some abnormal behaviour of the software. Software testing is a tedious, laborious and most time-consuming process. Automation tools have been developed that help to automate some activities of the testing process to enhance quality and timely delivery. Over time with the inclusion of continuous integration and continuous delivery (CI/CD) pipeline, automation tools are becoming less effective. The testing community is turning to AI to fill the gap as AI is able to check the code for bugs and errors without any human intervention and in a much faster way than humans. In this study, we aim to recognize the impact of AI technologies on various software testing activities or facets in the STLC. Further, the study aims to recognize and explain some of the biggest challenges software testers face while applying AI to testing. The paper also proposes some key contributions of AI in the future to the domain of software testing.
    Machine Learning of polymer types from the spectral signature of Raman spectroscopy microplastics data. (arXiv:2201.05445v1 [cs.LG])
    (2 min) The tools and technology that are currently used to analyze chemical compound structures that identify polymer types in microplastics are not well-calibrated for environmentally weathered microplastics. Microplastics that have been degraded by environmental weathering factors can offer less analytic certainty than samples of microplastics that have not been exposed to weathering processes. Machine learning tools and techniques allow us to better calibrate the research tools for certainty in microplastics analysis. In this paper, we investigate whether the signatures (Raman shift values) are distinct enough such that well studied machine learning (ML) algorithms can learn to identify polymer types using a relatively small amount of labeled input data when the samples have not been impacted by environmental degradation. Several ML models were trained on a well-known repository, Spectral Libraries of Plastic Particles (SLOPP), that contain Raman shift and intensity results for a range of plastic particles, then tested on environmentally aged plastic particles (SloPP-E) consisting of 22 polymer types. After extensive preprocessing and augmentation, the trained random forest model was then tested on the SloPP-E dataset resulting in an improvement in classification accuracy of 93.81% from 89%.
    Decompositional Quantum Graph Neural Network. (arXiv:2201.05158v1 [quant-ph])
    (2 min) Quantum machine learning is a fast emerging field that aims to tackle machine learning using quantum algorithms and quantum computing. Due to the lack of physical qubits and an effective means to map real-world data from Euclidean space to Hilbert space, most of these methods focus on quantum analogies or process simulations rather than devising concrete architectures based on qubits. In this paper, we propose a novel hybrid quantum-classical algorithm for graph-structured data, which we refer to as the Decompositional Quantum Graph Neural Network (DQGNN). DQGNN implements the GNN theoretical framework using the tensor product and unity matrices representation, which greatly reduces the number of model parameters required. When controlled by a classical computer, DQGNN can accommodate arbitrarily sized graphs by processing substructures from the input graph using a modestly-sized quantum device. The architecture is based on a novel mapping from real-world data to Hilbert space. This mapping maintains the distance relations present in the data and reduces information loss. Experimental results show that the proposed method outperforms competitive state-of-the-art models with only 1.68\% parameters compared to those models.
    Making a (Counterfactual) Difference One Rationale at a Time. (arXiv:2201.05177v1 [cs.CL])
    (2 min) Rationales, snippets of extracted text that explain an inference, have emerged as a popular framework for interpretable natural language processing (NLP). Rationale models typically consist of two cooperating modules: a selector and a classifier with the goal of maximizing the mutual information (MMI) between the "selected" text and the document label. Despite their promises, MMI-based methods often pick up on spurious text patterns and result in models with nonsensical behaviors. In this work, we investigate whether counterfactual data augmentation (CDA), without human assistance, can improve the performance of the selector by lowering the mutual information between spurious signals and the document label. Our counterfactuals are produced in an unsupervised fashion using class-dependent generative models. From an information theoretic lens, we derive properties of the unaugmented dataset for which our CDA approach would succeed. The effectiveness of CDA is empirically evaluated by comparing against several baselines including an improved MMI-based rationale schema on two multi aspect datasets. Our results show that CDA produces rationales that better capture the signal of interest.
    A Deep Generative Model for Molecule Optimization via One Fragment Modification. (arXiv:2012.04231v5 [cs.LG] UPDATED)
    (2 min) Molecule optimization is a critical step in drug development to improve desired properties of drug candidates through chemical modification. We developed a novel deep generative model Modof over molecular graphs for molecule optimization. Modof modifies a given molecule through the prediction of a single site of disconnection at the molecule and the removal and/or addition of fragments at that site. A pipeline of multiple, identical Modof models is implemented into Modof-pipe to modify an input molecule at multiple disconnection sites. Here we show that Modof-pipe is able to retain major molecular scaffolds, allow controls over intermediate optimization steps and better constrain molecule similarities. Modof-pipe outperforms the state-of-the-art methods on benchmark datasets: without molecular similarity constraints, Modof-pipe achieves 81.2% improvement in octanol-water partition coefficient penalized by synthetic accessibility and ring size; and 51.2%, 25.6% and 9.2% improvement if the optimized molecules are at least 0.2, 0.4 and 0.6 similar to those before optimization, respectively. Modof-pipe is further enhanced into Modof-pipem to allow modifying one molecule to multiple optimized ones. Modof-pipem achieves additional performance improvement as at least 17.8% better than Modof-pipe.
    Drop Clause: Enhancing Performance, Interpretability and Robustness of the Tsetlin Machine. (arXiv:2105.14506v2 [cs.LG] UPDATED)
    (2 min) In this article, we introduce a novel variant of the Tsetlin machine (TM) that randomly drops clauses, the key learning elements of a TM. In effect, TM with drop clause ignores a random selection of the clauses in each epoch, selected according to a predefined probability. In this way, additional stochasticity is introduced in the learning phase of TM. To explore the effects drop clause has on accuracy, training time, interpretability and robustness, we conduct extensive experiments on nine benchmark datasets in natural language processing~(NLP) (IMDb, R8, R52, MR and TREC) and image classification (MNIST, Fashion MNIST, CIFAR-10 and CIFAR-100). Our proposed model outperforms baseline machine learning algorithms by a wide margin and achieves competitive performance in comparison with recent deep learning model such as BERT and AlexNET-DFA. In brief, we observe up to +10% increase in accuracy and 2x to 4x faster learning compared with standard TM. We further employ the Convolutional TM to document interpretable results on the CIFAR datasets, visualizing how the heatmaps produced by the TM become more interpretable with drop clause. We also evaluate how drop clause affects learning robustness by introducing corruptions and alterations in the image/language test data. Our results show that drop clause makes TM more robust towards such changes.
    Understanding Bird's-Eye View of Road Semantics using an Onboard Camera. (arXiv:2012.03040v2 [cs.CV] UPDATED)
    (2 min) Autonomous navigation requires scene understanding of the action-space to move or anticipate events. For planner agents moving on the ground plane, such as autonomous vehicles, this translates to scene understanding in the bird's-eye view (BEV). However, the onboard cameras of autonomous cars are customarily mounted horizontally for a better view of the surrounding. In this work, we study scene understanding in the form of online estimation of semantic BEV maps using the video input from a single onboard camera. We study three key aspects of this task, image-level understanding, BEV level understanding, and the aggregation of temporal information. Based on these three pillars we propose a novel architecture that combines these three aspects. In our extensive experiments, we demonstrate that the considered aspects are complementary to each other for BEV understanding. Furthermore, the proposed architecture significantly surpasses the current state-of-the-art. Code: https://github.com/ybarancan/BEV_feat_stitch.
    Contextual Bandits for Advertising Campaigns: A Diffusion-Model Independent Approach (Extended Version). (arXiv:2201.05231v1 [cs.LG])
    (2 min) Motivated by scenarios of information diffusion and advertising in social media, we study an influence maximization problem in which little is assumed to be known about the diffusion network or about the model that determines how information may propagate. In such a highly uncertain environment, one can focus on multi-round diffusion campaigns, with the objective to maximize the number of distinct users that are influenced or activated, starting from a known base of few influential nodes. During a campaign, spread seeds are selected sequentially at consecutive rounds, and feedback is collected in the form of the activated nodes at each round. A round's impact (reward) is then quantified as the number of newly activated nodes. Overall, one must maximize the campaign's total spread, as the sum of rounds' rewards. In this setting, an explore-exploit approach could be used to learn the key underlying diffusion parameters, while running the campaign. We describe and compare two methods of contextual multi-armed bandits, with upper-confidence bounds on the remaining potential of influencers, one using a generalized linear model and the Good-Turing estimator for remaining potential (GLM-GT-UCB), and another one that directly adapts the LinUCB algorithm to our setting (LogNorm-LinUCB). We show that they outperform baseline methods using state-of-the-art ideas, on synthetic and real-world data, while at the same time exhibiting different and complementary behavior, depending on the scenarios in which they are deployed.
    Understanding Negative Samples in Instance Discriminative Self-supervised Representation Learning. (arXiv:2102.06866v5 [cs.LG] UPDATED)
    (2 min) Instance discriminative self-supervised representation learning has been attracted attention thanks to its unsupervised nature and informative feature representation for downstream tasks. In practice, it commonly uses a larger number of negative samples than the number of supervised classes. However, there is an inconsistency in the existing analysis; theoretically, a large number of negative samples degrade classification performance on a downstream supervised task, while empirically, they improve the performance. We provide a novel framework to analyze this empirical result regarding negative samples using the coupon collector's problem. Our bound can implicitly incorporate the supervised loss of the downstream task in the self-supervised loss by increasing the number of negative samples. We confirm that our proposed analysis holds on real-world benchmark datasets.
    Proceedings of the 4th Workshop on Online Recommender Systems and User Modeling -- ORSUM 2021. (arXiv:2201.05156v1 [cs.IR])
    (2 min) Modern online services continuously generate data at very fast rates. This continuous flow of data encompasses content -- e.g., posts, news, products, comments --, but also user feedback -- e.g., ratings, views, reads, clicks --, together with context data -- user device, spatial or temporal data, user task or activity, weather. This can be overwhelming for systems and algorithms designed to train in batches, given the continuous and potentially fast change of content, context and user preferences or intents. Therefore, it is important to investigate online methods able to transparently adapt to the inherent dynamics of online services. Incremental models that learn from data streams are gaining attention in the recommender systems community, given their natural ability to deal with the continuous flows of data generated in dynamic, complex environments. User modeling and personalization can particularly benefit from algorithms capable of maintaining models incrementally and online. The objective of this workshop is to foster contributions and bring together a growing community of researchers and practitioners interested in online, adaptive approaches to user modeling, recommendation and personalization, and their implications regarding multiple dimensions, such as evaluation, reproducibility, privacy and explainability.
    Multi-Domain Active Learning: A Comparative Study. (arXiv:2106.13516v4 [cs.LG] UPDATED)
    (2 min) Multi-domain learning (MDL) refers to learning a set of models simultaneously, with each one specialized to perform a task in a certain domain. Generally, high labeling effort is required in MDL, as data needs to be labeled by human experts for every domain. To address the above issue, Active learning (AL) can be utilized to reduce the labeling effort by only using the most informative data. The resultant paradigm is termed multi-domain active learning (MDAL). However, despite the practical significance of MDAL, there exists little research on it, not to mention any off-the-shelf solution. To fill this gap, we construct a simple pipeline of MDAL, and present a comprehensive comparative study of 30 different MDAL algorithms, which are established by combining 6 representative MDL models (equipped with various information-sharing schemes) and 5 well-used AL strategies. We evaluate the algorithms on 6 datasets, involving textual and visual classification tasks. In most cases, AL brings notable improvements to MDL, and surprisingly, the naive best vs second best (BvSB) uncertainty strategy could perform competitively to the state-of-the-art AL strategies. Besides, among the MDL models, MAN and SDL-joint achieve the top performance when applied to vector features and raw images, respectively. Furthermore, we qualitatively analyze the behaviors of these strategies and models, shedding light on their superior performance in the comparison. Overall, some guidelines are provided, which could help to choose MDL models and AL strategies for particular applications.
    Deep Learning for Agile Effort Estimation Have We Solved the Problem Yet?. (arXiv:2201.05401v1 [cs.SE])
    (2 min) In the last decade, several studies have proposed the use of automated techniques to estimate the effort of agile software development. In this paper we perform a close replication and extension of a seminal work proposing the use of Deep Learning for agile effort estimation (namely Deep-SE), which has set the state-of-the-art since. Specifically, we replicate three of the original research questions aiming at investigating the effectiveness of Deep-SE for both within-project and cross-project effort estimation. We benchmark Deep-SE against three baseline techniques (i.e., Random, Mean and Median effort prediction) and a previously proposed method to estimate agile software project development effort (dubbed TF/IDF-SE), as done in the original study. To this end, we use both the data from the original study and a new larger dataset of 31,960 issues, which we mined from 29 open-source projects. Using more data allows us to strengthen our confidence in the results and further mitigate the threat to the external validity of the study. We also extend the original study by investigating two additional research questions. One evaluates the accuracy of Deep-SE when the training set is augmented with issues from all other projects available in the repository at the time of estimation, and the other examines whether an expensive pre-training step used by the original Deep-SE, has any beneficial effect on its accuracy and convergence speed. The results of our replication show that Deep-SE outperforms the Median baseline estimator and TF/IDF-SE in only very few cases with statistical significance (8/42 and 9/32 cases, respectively), thus confounding previous findings on the efficacy of Deep-SE. The two additional RQs revealed that neither augmenting the training set nor pre-training Deep-SE play a role in improving its accuracy and convergence speed. ...
    CommonsenseQA 2.0: Exposing the Limits of AI through Gamification. (arXiv:2201.05320v1 [cs.CL])
    (2 min) Constructing benchmarks that test the abilities of modern natural language understanding models is difficult - pre-trained language models exploit artifacts in benchmarks to achieve human parity, but still fail on adversarial examples and make errors that demonstrate a lack of common sense. In this work, we propose gamification as a framework for data construction. The goal of players in the game is to compose questions that mislead a rival AI while using specific phrases for extra points. The game environment leads to enhanced user engagement and simultaneously gives the game designer control over the collected data, allowing us to collect high-quality data at scale. Using our method we create CommonsenseQA 2.0, which includes 14,343 yes/no questions, and demonstrate its difficulty for models that are orders-of-magnitude larger than the AI used in the game itself. Our best baseline, the T5-based Unicorn with 11B parameters achieves an accuracy of 70.2%, substantially higher than GPT-3 (52.9%) in a few-shot inference setup. Both score well below human performance which is at 94.1%.
    Data Science Methodologies: Current Challenges and Future Approaches. (arXiv:2106.07287v2 [cs.LG] UPDATED)
    (2 min) Data science has employed great research efforts in developing advanced analytics, improving data models and cultivating new algorithms. However, not many authors have come across the organizational and socio-technical challenges that arise when executing a data science project: lack of vision and clear objectives, a biased emphasis on technical issues, a low level of maturity for ad-hoc projects and the ambiguity of roles in data science are among these challenges. Few methodologies have been proposed on the literature that tackle these type of challenges, some of them date back to the mid-1990, and consequently they are not updated to the current paradigm and the latest developments in big data and machine learning technologies. In addition, fewer methodologies offer a complete guideline across team, project and data & information management. In this article we would like to explore the necessity of developing a more holistic approach for carrying out data science projects. We first review methodologies that have been presented on the literature to work on data science projects and classify them according to the their focus: project, team, data and information management. Finally, we propose a conceptual framework containing general characteristics that a methodology for managing data science projects with a holistic point of view should have. This framework can be used by other researchers as a roadmap for the design of new data science methodologies or the updating of existing ones.
    Towards Principled Disentanglement for Domain Generalization. (arXiv:2111.13839v2 [cs.LG] UPDATED)
    (2 min) A fundamental challenge for machine learning models is generalizing to out-of-distribution (OOD) data, in part due to spurious correlations. To tackle this challenge, we first formalize the OOD generalization problem as constrained optimization, called Disentanglement-constrained Domain Generalization (DDG). We relax this non-trivial constrained optimization to a tractable form with finite-dimensional parameterization and empirical approximation. Then a theoretical analysis of the extent to which the above transformations deviates from the original problem is provided. Based on the transformation, we propose a primal-dual algorithm for joint representation disentanglement and domain generalization. In contrast to traditional approaches based on domain adversarial training and domain labels, DDG jointly learns semantic and variation encoders for disentanglement, enabling flexible manipulation and augmentation on training data. DDG aims to learn intrinsic representations of semantic concepts that are invariant to nuisance factors and generalizable across different domains. Comprehensive experiments on popular benchmarks show that DDG can achieve competitive OOD performance and uncover interpretable salient structures within data.
    `Next Generation' Reservoir Computing: an Empirical Data-Driven Expression of Dynamical Equations in Time-Stepping Form. (arXiv:2201.05193v1 [cs.LG])
    (2 min) Next generation reservoir computing based on nonlinear vector autoregression (NVAR) is applied to emulate simple dynamical system models and compared to numerical integration schemes such as Euler and the $2^\text{nd}$ order Runge-Kutta. It is shown that the NVAR emulator can be interpreted as a data-driven method used to recover the numerical integration scheme that produced the data. It is also shown that the approach can be extended to produce high-order numerical schemes directly from data. The impacts of the presence of noise and temporal sparsity in the training set is further examined to gauge the potential use of this method for more realistic applications.
    Jamming Attacks on Federated Learning in Wireless Networks. (arXiv:2201.05172v1 [cs.LG])
    (2 min) Federated learning (FL) offers a decentralized learning environment so that a group of clients can collaborate to train a global model at the server, while keeping their training data confidential. This paper studies how to launch over-the-air jamming attacks to disrupt the FL process when it is executed over a wireless network. As a wireless example, FL is applied to learn how to classify wireless signals collected by clients (spectrum sensors) at different locations (such as in cooperative sensing). An adversary can jam the transmissions for the local model updates from clients to the server (uplink attack), or the transmissions for the global model updates the server to clients (downlink attack), or both. Given a budget imposed on the number of clients that can be attacked per FL round, clients for the (uplink/downlink) attack are selected according to their local model accuracies that would be expected without an attack or ranked via spectrum observations. This novel attack is extended to general settings by accounting different processing speeds and attack success probabilities for clients. Compared to benchmark attack schemes, this attack approach degrades the FL performance significantly, thereby revealing new vulnerabilities of FL to jamming attacks in wireless networks.
    Synthesising Electronic Health Records: Cystic Fibrosis Patient Group. (arXiv:2201.05400v1 [cs.LG])
    (2 min) Class imbalance can often degrade predictive performance of supervised learning algorithms. Balanced classes can be obtained by oversampling exact copies, with noise, or interpolation between nearest neighbours (as in traditional SMOTE methods). Oversampling tabular data using augmentation, as is typical in computer vision tasks, can be achieved with deep generative models. Deep generative models are effective data synthesisers due to their ability to capture complex underlying distributions. Synthetic data in healthcare can enhance interoperability between healthcare providers by ensuring patient privacy. Equipped with large synthetic datasets which do well to represent small patient groups, machine learning in healthcare can address the current challenges of bias and generalisability. This paper evaluates synthetic data generators ability to synthesise patient electronic health records. We test the utility of synthetic data for patient outcome classification, observing increased predictive performance when augmenting imbalanced datasets with synthetic data.
    When less is more: Simplifying inputs aids neural network understanding. (arXiv:2201.05610v1 [cs.LG])
    (2 min) How do neural network image classifiers respond to simpler and simpler inputs? And what do such responses reveal about the learning process? To answer these questions, we need a clear measure of input simplicity (or inversely, complexity), an optimization objective that correlates with simplification, and a framework to incorporate such objective into training and inference. Lastly we need a variety of testbeds to experiment and evaluate the impact of such simplification on learning. In this work, we measure simplicity with the encoding bit size given by a pretrained generative model, and minimize the bit size to simplify inputs in training and inference. We investigate the effect of such simplification in several scenarios: conventional training, dataset condensation and post-hoc explanations. In all settings, inputs are simplified along with the original classification task, and we investigate the trade-off between input simplicity and task performance. For images with injected distractors, such simplification naturally removes superfluous information. For dataset condensation, we find that inputs can be simplified with almost no accuracy degradation. When used in post-hoc explanation, our learning-based simplification approach offers a valuable new tool to explore the basis of network decisions.
    Probabilistic spatial clustering based on the Self Discipline Learning (SDL) model of autonomous learning. (arXiv:2201.03449v2 [cs.LG] UPDATED)
    (2 min) Unsupervised clustering algorithm can effectively reduce the dimension of high-dimensional unlabeled data, thus reducing the time and space complexity of data processing. However, the traditional clustering algorithm needs to set the upper bound of the number of categories in advance, and the deep learning clustering algorithm will fall into the problem of local optimum. In order to solve these problems, a probabilistic spatial clustering algorithm based on the Self Discipline Learning(SDL) model is proposed. The algorithm is based on the Gaussian probability distribution of the probability space distance between vectors, and uses the probability scale and maximum probability value of the probability space distance as the distance measurement judgment, and then determines the category of each sample according to the distribution characteristics of the data set itself. The algorithm is tested in Laboratory for Intelligent and Safe Automobiles(LISA) traffic light data set, the accuracy rate is 99.03%, the recall rate is 91%, and the effect is achieved.
    AuGPT: Auxiliary Tasks and Data Augmentation for End-To-End Dialogue with Pre-Trained Language Models. (arXiv:2102.05126v3 [cs.CL] UPDATED)
    (2 min) Attention-based pre-trained language models such as GPT-2 brought considerable progress to end-to-end dialogue modelling. However, they also present considerable risks for task-oriented dialogue, such as lack of knowledge grounding or diversity. To address these issues, we introduce modified training objectives for language model finetuning, and we employ massive data augmentation via back-translation to increase the diversity of the training data. We further examine the possibilities of combining data from multiples sources to improve performance on the target dataset. We carefully evaluate our contributions with both human and automatic methods. Our model substantially outperforms the baseline on the MultiWOZ data and shows competitive performance with state of the art in both automatic and human evaluation.
    Communication-Efficient ADMM-based Federated Learning. (arXiv:2110.15318v2 [cs.LG] UPDATED)
    (2 min) Federated learning has shown its advances over the last few years but is facing many challenges, such as how algorithms save communication resources, how they reduce computational costs, and whether they converge. To address these issues, this paper proposes exact and inexact ADMM-based federated learning. They are not only communication-efficient but also converge linearly under very mild conditions, such as convexity-free and irrelevance to data distributions. Moreover, the inexact version has low computational complexity, thereby alleviating the computational burdens significantly.
    Segmentation of cell-level anomalies in electroluminescence images of photovoltaic modules. (arXiv:2106.10962v2 [cs.CV] UPDATED)
    (2 min) In the operation & maintenance (O&M) of photovoltaic (PV) plants, the early identification of failures has become crucial to maintain productivity and prolong components' life. Of all defects, cell-level anomalies can lead to serious failures and may affect surrounding PV modules in the long run. These fine defects are usually captured with high spatial resolution electroluminescence (EL) imaging. The difficulty of acquiring such images has limited the availability of data. For this work, multiple data resources and augmentation techniques have been used to surpass this limitation. Current state-of-the-art detection methods extract barely low-level information from individual PV cell images, and their performance is conditioned by the available training data. In this article, we propose an end-to-end deep learning pipeline that detects, locates and segments cell-level anomalies from entire photovoltaic modules via EL images. The proposed modular pipeline combines three deep learning techniques: 1. object detection (modified Faster-RNN), 2. image classification (EfficientNet) and 3. weakly supervised segmentation (autoencoder). The modular nature of the pipeline allows to upgrade the deep learning models to the further improvements in the state-of-the-art and also extend the pipeline towards new functionalities.
    SignalNet: A Low Resolution Sinusoid Decomposition and Estimation Network. (arXiv:2106.05490v2 [eess.SP] UPDATED)
    (2 min) The detection and estimation of sinusoids is a fundamental signal processing task for many applications related to sensing and communications. While algorithms have been proposed for this setting, quantization is a critical, but often ignored modeling effect. In wireless communications, estimation with low resolution data converters is relevant for reduced power consumption in wideband receivers. Similarly, low resolution sampling in imaging and spectrum sensing allows for efficient data collection. In this work, we propose SignalNet, a neural network architecture that detects the number of sinusoids and estimates their parameters from quantized in-phase and quadrature samples. We incorporate signal reconstruction internally as domain knowledge within the network to enhance learning and surpass traditional algorithms in mean squared error and Chamfer error. We introduce a worst-case learning threshold for comparing the results of our network relative to the underlying data distributions. This threshold provides insight into why neural networks tend to outperform traditional methods and into the learned relationships between the input and output distributions. In simulation, we find that our algorithm is always able to surpass the threshold for three-bit data but often cannot exceed the threshold for one-bit data. We use the learning threshold to explain, in the one-bit case, how our estimators learn to minimize the distributional loss, rather than learn features from the data.
    Neural Circuit Architectural Priors for Embodied Control. (arXiv:2201.05242v1 [cs.LG])
    (2 min) Artificial neural networks for simulated motor control and robotics often adopt generic architectures like fully connected MLPs. While general, these tabula rasa architectures rely on large amounts of experience to learn, are not easily transferable to new bodies, and have internal dynamics that are difficult to interpret. In nature, animals are born with highly structured connectivity in their nervous systems shaped by evolution; this innate circuitry acts synergistically with learning mechanisms to provide inductive biases that enable most animals to function well soon after birth and improve abilities efficiently. Convolutional networks inspired by visual circuitry have encoded useful biases for vision. However, it is unknown the extent to which ANN architectures inspired by neural circuitry can yield useful biases for other domains. In this work, we ask what advantages biologically inspired network architecture can provide in the context of motor control. Specifically, we translate C. elegans circuits for locomotion into an ANN model controlling a simulated Swimmer agent. On a locomotion task, our architecture achieves good initial performance and asymptotic performance comparable with MLPs, while dramatically improving data efficiency and requiring orders of magnitude fewer parameters. Our architecture is more interpretable and transfers to new body designs. An ablation analysis shows that principled excitation/inhibition is crucial for learning, while weight initialization contributes to good initial performance. Our work demonstrates several advantages of ANN architectures inspired by systems neuroscience and suggests a path towards modeling more complex behavior.
    A novel notion of barycenter for probability distributions based on optimal weak mass transport. (arXiv:2102.13380v3 [stat.ML] UPDATED)
    (2 min) We introduce weak barycenters of a family of probability distributions, based on the recently developed notion of optimal weak transport of mass by Gozlanet al. (2017) and Backhoff-Veraguas et al. (2020). We provide a theoretical analysis of this object and discuss its interpretation in the light of convex ordering between probability measures. In particular, we show that, rather than averaging the input distributions in a geometric way (as the Wasserstein barycenter based on classic optimal transport does) weak barycenters extract common geometric information shared by all the input distributions, encoded as a latent random variable that underlies all of them. We also provide an iterative algorithm to compute a weak barycenter for a finite family of input distributions, and a stochastic algorithm that computes them for arbitrary populations of laws. The latter approach is particularly well suited for the streaming setting, i.e., when distributions are observed sequentially. The notion of weak barycenter and our approaches to compute it are illustrated on synthetic examples, validated on 2D real-world data and compared to standard Wasserstein barycenters.
    Reproducible, incremental representation learning with Rosetta VAE. (arXiv:2201.05206v1 [cs.LG])
    (2 min) Variational autoencoders are among the most popular methods for distilling low-dimensional structure from high-dimensional data, making them increasingly valuable as tools for data exploration and scientific discovery. However, unlike typical machine learning problems in which a single model is trained once on a single large dataset, scientific workflows privilege learned features that are reproducible, portable across labs, and capable of incrementally adding new data. Ideally, methods used by different research groups should produce comparable results, even without sharing fully trained models or entire data sets. Here, we address this challenge by introducing the Rosetta VAE (R-VAE), a method of distilling previously learned representations and retraining new models to reproduce and build on prior results. The R-VAE uses post hoc clustering over the latent space of a fully-trained model to identify a small number of Rosetta Points (input, latent pairs) to serve as anchors for training future models. An adjustable hyperparameter, $\rho$, balances fidelity to the previously learned latent space against accommodation of new data. We demonstrate that the R-VAE reconstructs data as well as the VAE and $\beta$-VAE, outperforms both methods in recovery of a target latent space in a sequential training setting, and dramatically increases consistency of the learned representation across training runs.
    Precise Stock Price Prediction for Robust Portfolio Design from Selected Sectors of the Indian Stock Market. (arXiv:2201.05570v1 [q-fin.PM])
    (0 min) Stock price prediction is a challenging task and a lot of propositions exist in the literature in this area. Portfolio construction is a process of choosing a group of stocks and investing in them optimally to maximize the return while minimizing the risk. Since the time when Markowitz proposed the Modern Portfolio Theory, several advancements have happened in the area of building efficient portfolios. An investor can get the best benefit out of the stock market if the investor invests in an efficient portfolio and could take the buy or sell decision in advance, by estimating the future asset value of the portfolio with a high level of precision. In this project, we have built an efficient portfolio and to predict the future asset value by means of individual stock price prediction of the stocks in the portfolio. As part of building an efficient portfolio we have studied multiple portfolio optimization methods beginning with the Modern Portfolio theory. We have built the minimum variance portfolio and optimal risk portfolio for all the five chosen sectors by using past daily stock prices over the past five years as the training data, and have also conducted back testing to check the performance of the portfolio. A comparative study of minimum variance portfolio and optimal risk portfolio with equal weight portfolio is done by backtesting.
    IDEA: Interpretable Dynamic Ensemble Architecture for Time Series Prediction. (arXiv:2201.05336v1 [cs.LG])
    (0 min) We enhance the accuracy and generalization of univariate time series point prediction by an explainable ensemble on the fly. We propose an Interpretable Dynamic Ensemble Architecture (IDEA), in which interpretable base learners give predictions independently with sparse communication as a group. The model is composed of several sequentially stacked groups connected by group backcast residuals and recurrent input competition. Ensemble driven by end-to-end training both horizontally and vertically brings state-of-the-art (SOTA) performances. Forecast accuracy improves by 2.6% over the best statistical benchmark on the TOURISM dataset and 2% over the best deep learning benchmark on the M4 dataset. The architecture enjoys several advantages, being applicable to time series from various domains, explainable to users with specialized modular structure and robust to changes in task distribution.
    Convolutional Neural Network Compression through Generalized Kronecker Product Decomposition. (arXiv:2109.14710v2 [cs.CV] UPDATED)
    (0 min) Modern Convolutional Neural Network (CNN) architectures, despite their superiority in solving various problems, are generally too large to be deployed on resource constrained edge devices. In this paper, we reduce memory usage and floating-point operations required by convolutional layers in CNNs. We compress these layers by generalizing the Kronecker Product Decomposition to apply to multidimensional tensors, leading to the Generalized Kronecker Product Decomposition (GKPD). Our approach yields a plug-and-play module that can be used as a drop-in replacement for any convolutional layer. Experimental results for image classification on CIFAR-10 and ImageNet datasets using ResNet, MobileNetv2 and SeNet architectures substantiate the effectiveness of our proposed approach. We find that GKPD outperforms state-of-the-art decomposition methods including Tensor-Train and Tensor-Ring as well as other relevant compression methods such as pruning and knowledge distillation.
    Demystifying Swarm Learning: A New Paradigm of Blockchain-based Decentralized Federated Learning. (arXiv:2201.05286v1 [cs.LG])
    (2 min) Federated learning (FL) is an emerging promising privacy-preserving machine learning paradigm and has raised more and more attention from researchers and developers. FL keeps users' private data on devices and exchanges the gradients of local models to cooperatively train a shared Deep Learning (DL) model on central custodians. However, the security and fault tolerance of FL have been increasingly discussed, because its central custodian mechanism or star-shaped architecture can be vulnerable to malicious attacks or software failures. To address these problems, Swarm Learning (SL) introduces a permissioned blockchain to securely onboard members and dynamically elect the leader, which allows performing DL in an extremely decentralized manner. Compared with tremendous attention to SL, there are few empirical studies on SL or blockchain-based decentralized FL, which provide comprehensive knowledge of best practices and precautions of deploying SL in real-world scenarios. Therefore, we conduct the first comprehensive study of SL to date, to fill the knowledge gap between SL deployment and developers, as far as we are concerned. In this paper, we conduct various experiments on 3 public datasets of 5 research questions, present interesting findings, quantitatively analyze the reasons behind these findings, and provide developers and researchers with practical suggestions. The findings have evidenced that SL is supposed to be suitable for most application scenarios, no matter whether the dataset is balanced, polluted, or biased over irrelevant features.
    Collaborative learning of images and geometrics for predicting isocitrate dehydrogenase status of glioma. (arXiv:2201.05530v1 [cs.LG])
    (0 min) The isocitrate dehydrogenase (IDH) gene mutation status is an important biomarker for glioma patients. The gold standard of IDH mutation detection requires tumour tissue obtained via invasive approaches and is usually expensive. Recent advancement in radiogenomics provides a non-invasive approach for predicting IDH mutation based on MRI. Meanwhile, tumor geometrics encompass crucial information for tumour phenotyping. Here we propose a collaborative learning framework that learns both tumor images and tumor geometrics using convolutional neural networks (CNN) and graph neural networks (GNN), respectively. Our results show that the proposed model outperforms the baseline model of 3D-DenseNet121. Further, the collaborative learning model achieves better performance than either the CNN or the GNN alone. The model interpretation shows that the CNN and GNN could identify common and unique regions of interest for IDH mutation prediction. In conclusion, collaborating image and geometric learners provides a novel approach for predicting genotype and characterising glioma.
    The Implicit Regularization of Momentum Gradient Descent with Early Stopping. (arXiv:2201.05405v1 [cs.LG])
    (0 min) The study on the implicit regularization induced by gradient-based optimization is a longstanding pursuit. In the present paper, we characterize the implicit regularization of momentum gradient descent (MGD) with early stopping by comparing with the explicit $\ell_2$-regularization (ridge). In details, we study MGD in the continuous-time view, so-called momentum gradient flow (MGF), and show that its tendency is closer to ridge than the gradient descent (GD) [Ali et al., 2019] for least squares regression. Moreover, we prove that, under the calibration $t=\sqrt{2/\lambda}$, where $t$ is the time parameter in MGF and $\lambda$ is the tuning parameter in ridge regression, the risk of MGF is no more than 1.54 times that of ridge. In particular, the relative Bayes risk of MGF to ridge is between 1 and 1.035 under the optimal tuning. The numerical experiments support our theoretical results strongly.
    Communication-Efficient TeraByte-Scale Model Training Framework for Online Advertising. (arXiv:2201.05500v1 [cs.IR])
    (0 min) Click-Through Rate (CTR) prediction is a crucial component in the online advertising industry. In order to produce a personalized CTR prediction, an industry-level CTR prediction model commonly takes a high-dimensional (e.g., 100 or 1000 billions of features) sparse vector (that is encoded from query keywords, user portraits, etc.) as input. As a result, the model requires Terabyte scale parameters to embed the high-dimensional input. Hierarchical distributed GPU parameter server has been proposed to enable GPU with limited memory to train the massive network by leveraging CPU main memory and SSDs as secondary storage. We identify two major challenges in the existing GPU training framework for massive-scale ad models and propose a collection of optimizations to tackle these challenges: (a) the GPU, CPU, SSD rapidly communicate with each other during the training. The connections between GPUs and CPUs are non-uniform due to the hardware topology. The data communication route should be optimized according to the hardware topology; (b) GPUs in different computing nodes frequently communicates to synchronize parameters. We are required to optimize the communications so that the distributed system can become scalable. In this paper, we propose a hardware-aware training workflow that couples the hardware topology into the algorithm design. To reduce the extensive communication between computing nodes, we introduce a $k$-step model merging algorithm for the popular Adam optimizer and provide its convergence rate in non-convex optimization. To the best of our knowledge, this is the first application of $k$-step adaptive optimization method in industrial-level CTR model training. The numerical results on real-world data confirm that the optimized system design considerably reduces the training time of the massive model, with essentially no loss in accuracy.
    Classification of Brain Tumours in MR Images using Deep Spatiospatial Models. (arXiv:2105.14071v2 [eess.IV] UPDATED)
    (0 min) A brain tumour is a mass or cluster of abnormal cells in the brain, which has the possibility of becoming life-threatening because of its ability to invade neighbouring tissues and also form metastases. An accurate diagnosis is essential for successful treatment planning and magnetic resonance imaging is the principal imaging modality for diagnostic of brain tumours and their extent. Deep Learning methods in computer vision applications have shown significant improvement in recent years, most of which can be credited to the fact that a sizeable amount of data is available to train models on, and the improvements in the model architectures yielding better approximations in a supervised setting. Classifying tumours using such deep learning methods has made significant progress with the availability of open datasets with reliable annotations. Typically those methods are either 3D models, which use 3D volumetric MRIs or even 2D models considering each slice separately. However, by treating the slice spatial dimension separately, spatiotemporal models can be employed as spatiospatial models for this task. These models have the capabilities of learning specific spatial and temporal relationship, while reducing computational costs. This paper uses two spatiotemporal models, ResNet (2+1)D and ResNet Mixed Convolution, to classify different types of brain tumours. It was observed that both these models performed superior to the pure 3D convolutional model, ResNet18. Furthermore, it was also observed that pre-training the models on a different, even unrelated dataset before training them for the task of tumour classification improves the performance. Finally, Pre-trained ResNet Mixed Convolution was observed to be the best model in these experiments, achieving a macro F1-score of 0.93 and a test accuracy of 96.98\%, while at the same time being the model with the least computational cost.
    Consistent Approximations in Composite Optimization. (arXiv:2201.05250v1 [math.OC])
    (0 min) Approximations of optimization problems arise in computational procedures and sensitivity analysis. The resulting effect on solutions can be significant, with even small approximations of components of a problem translating into large errors in the solutions. We specify conditions under which approximations are well behaved in the sense of minimizers, stationary points, and level-sets and this leads to a framework of consistent approximations. The framework is developed for a broad class of composite problems, which are neither convex nor smooth. We demonstrate the framework using examples from stochastic optimization, neural-network based machine learning, distributionally robust optimization, penalty and augmented Lagrangian methods, interior-point methods, homotopy methods, smoothing methods, extended nonlinear programming, difference-of-convex programming, and multi-objective optimization. An enhanced proximal method illustrates the algorithmic possibilities. A quantitative analysis supplements the development by furnishing rates of convergence.
    Decentralized Robot Learning for Personalization and Privacy. (arXiv:2201.05527v1 [cs.LG])
    (2 min) From learning assistance to companionship, social robots promise to enhance many aspects of daily life. However, social robots have not seen widespread adoption, in part because (1) they do not adapt their behavior to new users, and (2) they do not provide sufficient privacy protections. Centralized learning, whereby robots develop skills by gathering data on a server, contributes to these limitations by preventing online learning of new experiences and requiring storage of privacy-sensitive data. In this work, we propose a decentralized learning alternative that improves the privacy and personalization of social robots. We combine two machine learning approaches, Federated Learning and Continual Learning, to capture interaction dynamics distributed physically across robots and temporally across repeated robot encounters. We define a set of criteria that should be balanced in decentralized robot learning scenarios. We also develop a new algorithm -- Elastic Transfer -- that leverages importance-based regularization to preserve relevant parameters across robots and interactions with multiple humans. We show that decentralized learning is a viable alternative to centralized learning in a proof-of-concept Socially-Aware Navigation domain, and demonstrate how Elastic Transfer improves several of the proposed criteria.
    Data Fusion with Latent Map Gaussian Processes. (arXiv:2112.02206v2 [stat.ML] UPDATED)
    (0 min) Multi-fidelity modeling and calibration are data fusion tasks that ubiquitously arise in engineering design. In this paper, we introduce a novel approach based on latent-map Gaussian processes (LMGPs) that enables efficient and accurate data fusion. In our approach, we convert data fusion into a latent space learning problem where the relations among different data sources are automatically learned. This conversion endows our approach with attractive advantages such as increased accuracy, reduced costs, flexibility to jointly fuse any number of data sources, and ability to visualize correlations between data sources. This visualization allows the user to detect model form errors or determine the optimum strategy for high-fidelity emulation by fitting LMGP only to the subset of the data sources that are well-correlated. We also develop a new kernel function that enables LMGPs to not only build a probabilistic multi-fidelity surrogate but also estimate calibration parameters with high accuracy and consistency. The implementation and use of our approach are considerably simpler and less prone to numerical issues compared to existing technologies. We demonstrate the benefits of LMGP-based data fusion by comparing its performance against competing methods on a wide range of examples.
    Use of High Dimensional Modeling for automatic variables selection: the best path algorithm. (arXiv:2105.03173v2 [stat.ML] UPDATED)
    (0 min) This paper presents a new algorithm for automatic variables selection. In particular, using the Graphical Models properties it is possible to develop a method that can be used in the contest of large dataset. The advantage of this algorithm is that can be combined with different forecasting models. In this research we have used the OLS method and we have compared the result with the LASSO method.
    E(n) Equivariant Normalizing Flows. (arXiv:2105.09016v4 [cs.LG] UPDATED)
    (0 min) This paper introduces a generative model equivariant to Euclidean symmetries: E(n) Equivariant Normalizing Flows (E-NFs). To construct E-NFs, we take the discriminative E(n) graph neural networks and integrate them as a differential equation to obtain an invertible equivariant function: a continuous-time normalizing flow. We demonstrate that E-NFs considerably outperform baselines and existing methods from the literature on particle systems such as DW4 and LJ13, and on molecules from QM9 in terms of log-likelihood. To the best of our knowledge, this is the first flow that jointly generates molecule features and positions in 3D.
    Probabilistic Mass Mapping with Neural Score Estimation. (arXiv:2201.05561v1 [astro-ph.CO])
    (0 min) Weak lensing mass-mapping is a useful tool to access the full distribution of dark matter on the sky, but because of intrinsic galaxy ellipticies and finite fields/missing data, the recovery of dark matter maps constitutes a challenging ill-posed inverse problem. We introduce a novel methodology allowing for efficient sampling of the high-dimensional Bayesian posterior of the weak lensing mass-mapping problem, and relying on simulations for defining a fully non-Gaussian prior. We aim to demonstrate the accuracy of the method on simulations, and then proceed to applying it to the mass reconstruction of the HST/ACS COSMOS field. The proposed methodology combines elements of Bayesian statistics, analytic theory, and a recent class of Deep Generative Models based on Neural Score Matching. This approach allows us to do the following: 1) Make full use of analytic cosmological theory to constrain the 2pt statistics of the solution. 2) Learn from cosmological simulations any differences between this analytic prior and full simulations. 3) Obtain samples from the full Bayesian posterior of the problem for robust Uncertainty Quantification. We demonstrate the method on the $\kappa$TNG simulations and find that the posterior mean significantly outperfoms previous methods (Kaiser-Squires, Wiener filter, Sparsity priors) both on root-mean-square error and in terms of the Pearson correlation. We further illustrate the interpretability of the recovered posterior by establishing a close correlation between posterior convergence values and SNR of clusters artificially introduced into a field. Finally, we apply the method to the reconstruction of the HST/ACS COSMOS field and yield the highest quality convergence map of this field to date.
    Tactical Optimism and Pessimism for Deep Reinforcement Learning. (arXiv:2102.03765v4 [cs.LG] UPDATED)
    (0 min) In recent years, deep off-policy actor-critic algorithms have become a dominant approach to reinforcement learning for continuous control. One of the primary drivers of this improved performance is the use of pessimistic value updates to address function approximation errors, which previously led to disappointing performance. However, a direct consequence of pessimism is reduced exploration, running counter to theoretical support for the efficacy of optimism in the face of uncertainty. So which approach is best? In this work, we show that the most effective degree of optimism can vary both across tasks and over the course of learning. Inspired by this insight, we introduce a novel deep actor-critic framework, Tactical Optimistic and Pessimistic (TOP) estimation, which switches between optimistic and pessimistic value learning online. This is achieved by formulating the selection as a multi-arm bandit problem. We show in a series of continuous control tasks that TOP outperforms existing methods which rely on a fixed degree of optimism, setting a new state of the art in challenging pixel-based environments. Since our changes are simple to implement, we believe these insights can easily be incorporated into a multitude of off-policy algorithms.
    Multimodal analysis of the predictability of hand-gesture properties. (arXiv:2108.05762v3 [cs.HC] UPDATED)
    (0 min) Embodied conversational agents benefit from being able to accompany their speech with gestures. Although many data-driven approaches to gesture generation have been proposed in recent years, it is still unclear whether such systems can consistently generate gestures that convey meaning. We investigate which gesture properties (phase, category, and semantics) can be predicted from speech text and/or audio using contemporary deep learning. In extensive experiments, we show that gesture properties related to gesture meaning (semantics and category) are predictable from text features (time-aligned FastText embeddings) alone, but not from prosodic audio features, while rhythm-related gesture properties (phase) on the other hand can be predicted from audio features better than from text. These results are encouraging as they indicate that it is possible to equip an embodied agent with content-wise meaningful co-speech gestures using a machine-learning model.
    StAnD: A Dataset of Linear Static Analysis Problems. (arXiv:2201.05356v1 [cs.LG])
    (0 min) Static analysis of structures is a fundamental step for determining the stability of structures. Both linear and non-linear static analyses consist of the resolution of sparse linear systems obtained by the finite element method. The development of fast and optimized solvers for sparse linear systems appearing in structural engineering requires data to compare existing approaches, tune algorithms or to evaluate new ideas. We introduce the Static Analysis Dataset (StAnD) containing 303.000 static analysis problems obtained applying realistic loads to simulated frame structures. Along with the dataset, we publish a detailed benchmark comparison of the running time of existing solvers both on CPU and GPU. We release the code used to generate the dataset and benchmark existing solvers on Github. To the best of our knowledge, this is the largest dataset for static analysis problems and it is the first public dataset of sparse linear systems (containing both the matrix and a realistic constant term).
    Towards a Data Privacy-Predictive Performance Trade-off. (arXiv:2201.05226v1 [cs.LG])
    (2 min) Machine learning is increasingly used in the most diverse applications and domains, whether in healthcare, to predict pathologies, or in the financial sector to detect fraud. One of the linchpins for efficiency and accuracy in machine learning is data utility. However, when it contains personal information, full access may be restricted due to laws and regulations aiming to protect individuals' privacy. Therefore, data owners must ensure that any data shared guarantees such privacy. Removal or transformation of private information (de-identification) are among the most common techniques. Intuitively, one can anticipate that reducing detail or distorting information would result in losses for model predictive performance. However, previous work concerning classification tasks using de-identified data generally demonstrates that predictive performance can be preserved in specific applications. In this paper, we aim to evaluate the existence of a trade-off between data privacy and predictive performance in classification tasks. We leverage a large set of privacy-preserving techniques and learning algorithms to provide an assessment of re-identification ability and the impact of transformed variants on predictive performance. Unlike previous literature, we confirm that the higher the level of privacy (lower re-identification risk), the higher the impact on predictive performance, pointing towards clear evidence of a trade-off.
    Robust Explainability: A Tutorial on Gradient-Based Attribution Methods for Deep Neural Networks. (arXiv:2107.11400v4 [cs.LG] UPDATED)
    (0 min) With the rise of deep neural networks, the challenge of explaining the predictions of these networks has become increasingly recognized. While many methods for explaining the decisions of deep neural networks exist, there is currently no consensus on how to evaluate them. On the other hand, robustness is a popular topic for deep learning research; however, it is hardly talked about in explainability until very recently. In this tutorial paper, we start by presenting gradient-based interpretability methods. These techniques use gradient signals to assign the burden of the decision on the input features. Later, we discuss how gradient-based methods can be evaluated for their robustness and the role that adversarial robustness plays in having meaningful explanations. We also discuss the limitations of gradient-based methods. Finally, we present the best practices and attributes that should be examined before choosing an explainability method. We conclude with the future directions for research in the area at the convergence of robustness and explainability.
    Generalization Error Bounds for Iterative Recovery Algorithms Unfolded as Neural Networks. (arXiv:2112.04364v2 [cs.LG] UPDATED)
    (2 min) Motivated by the learned iterative soft thresholding algorithm (LISTA), we introduce a general class of neural networks suitable for sparse reconstruction from few linear measurements. By allowing a wide range of degrees of weight-sharing between the layers, we enable a unified analysis for very different neural network types, ranging from recurrent ones to networks more similar to standard feedforward neural networks. Based on training samples, via empirical risk minimization we aim at learning the optimal network parameters and thereby the optimal network that reconstructs signals from their low-dimensional linear measurements. We derive generalization bounds by analyzing the Rademacher complexity of hypothesis classes consisting of such deep networks, that also take into account the thresholding parameters. We obtain estimates of the sample complexity that essentially depend only linearly on the number of parameters and on the depth. We apply our main result to obtain specific generalization bounds for several practical examples, including different algorithms for (implicit) dictionary learning, and convolutional neural networks.
    A Method for Controlling Extrapolation when Visualizing and Optimizing the Prediction Profiles of Statistical and Machine Learning Models. (arXiv:2201.05236v1 [stat.ML])
    (0 min) We present a novel method for controlling extrapolation in the prediction profiler in the JMP software. The prediction profiler is a graphical tool for exploring high dimensional prediction surfaces for statistical and machine learning models. The profiler contains interactive cross-sectional views, or profile traces, of the prediction surface of a model. Our method helps users avoid exploring predictions that should be considered extrapolation. It also performs optimization over a constrained factor region that avoids extrapolation using a genetic algorithm. In simulations and real world examples, we demonstrate how optimal factor settings without constraint in the profiler are frequently extrapolated, and how extrapolation control helps avoid these solutions with invalid factor settings that may not be useful to the user.
    A New Deep Hybrid Boosted and Ensemble Learning-based Brain Tumor Analysis using MRI. (arXiv:2201.05373v1 [eess.IV])
    (0 min) Brain tumors analysis is important in timely diagnosis and effective treatment to cure patients. Tumor analysis is challenging because of tumor morphology like size, location, texture, and heteromorphic appearance in the medical images. In this regard, a novel two-phase deep learning-based framework is proposed to detect and categorize brain tumors in magnetic resonance images (MRIs). In the first phase, a novel deep boosted features and ensemble classifiers (DBF-EC) scheme is proposed to detect tumor MRI images from healthy individuals effectively. The deep boosted feature space is achieved through the customized and well-performing deep convolutional neural networks (CNNs), and consequently, fed into the ensemble of machine learning (ML) classifiers. While in the second phase, a new hybrid features fusion-based brain tumor classification approach is proposed, comprised of dynamic-static feature and ML classifier to categorize different tumor types. The dynamic features are extracted from the proposed BRAIN-RENet CNN, which carefully learns heteromorphic and inconsistent behavior of various tumors, while the static features are extracted using HOG. The effectiveness of the proposed two-phase brain tumor analysis framework is validated on two standard benchmark datasets; collected from Kaggle and Figshare containing different types of tumor, including glioma, meningioma, pituitary, and normal images. Experimental results proved that the proposed DBF-EC detection scheme outperforms and achieved accuracy (99.56%), precision (0.9991), recall (0.9899), F1-Score (0.9945), MCC (0.9892), and AUC-PR (0.9990). While the classification scheme, the joint employment of the deep features fusion of proposed BRAIN-RENet and HOG features improves performance significantly in terms of recall (0.9913), precision (0.9906), F1-Score (0.9909), and accuracy (99.20%) on diverse datasets.
    Domain-shift adaptation via linear transformations. (arXiv:2201.05282v1 [cs.LG])
    (0 min) A predictor, $f_A : X \to Y$, learned with data from a source domain (A) might not be accurate on a target domain (B) when their distributions are different. Domain adaptation aims to reduce the negative effects of this distribution mismatch. Here, we analyze the case where $P_A(Y\ |\ X) \neq P_B(Y\ |\ X)$, $P_A(X) \neq P_B(X)$ but $P_A(Y) = P_B(Y)$; where there are affine transformations of $X$ that makes all distributions equivalent. We propose an approach to project the source and target domains into a lower-dimensional, common space, by (1) projecting the domains into the eigenvectors of the empirical covariance matrices of each domain, then (2) finding an orthogonal matrix that minimizes the maximum mean discrepancy between the projections of both domains. For arbitrary affine transformations, there is an inherent unidentifiability problem when performing unsupervised domain adaptation that can be alleviated in the semi-supervised case. We show the effectiveness of our approach in simulated data and in binary digit classification tasks, obtaining improvements up to 48% accuracy when correcting for the domain shift in the data.
    Machine Learning for Multi-Output Regression: When should a holistic multivariate approach be preferred over separate univariate ones?. (arXiv:2201.05340v1 [stat.ML])
    (0 min) Tree-based ensembles such as the Random Forest are modern classics among statistical learning methods. In particular, they are used for predicting univariate responses. In case of multiple outputs the question arises whether we separately fit univariate models or directly follow a multivariate approach. For the latter, several possibilities exist that are, e.g. based on modified splitting or stopping rules for multi-output regression. In this work we compare these methods in extensive simulations to help in answering the primary question when to use multivariate ensemble techniques.
    On Preference Learning Based on Sequential Bayesian Optimization with Pairwise Comparison. (arXiv:2103.13192v2 [cs.LG] UPDATED)
    (0 min) User preference learning is generally a hard problem. Individual preferences are typically unknown even to users themselves, while the space of choices is infinite. Here we study user preference learning from information-theoretic perspective. We model preference learning as a system with two interacting sub-systems, one representing a user with his/her preferences and another one representing an agent that has to learn these preferences. The user with his/her behaviour is modeled by a parametric preference function. To efficiently learn the preferences and reduce search space quickly, we propose the agent that interacts with the user to collect the most informative data for learning. The agent presents two proposals to the user for evaluation, and the user rates them based on his/her preference function. We show that the optimum agent strategy for data collection and preference learning is a result of maximin optimization of the normalized weighted Kullback-Leibler (KL) divergence between true and agent-assigned predictive user response distributions. The resulting value of KL-divergence, which we also call remaining system uncertainty (RSU), provides an efficient performance metric in the absence of the ground truth. This metric characterises how well the agent can predict user and, thus, the quality of the underlying learned user (preference) model. Our proposed agent comprises sequential mechanisms for user model inference and proposal generation. To infer the user model (preference function), Bayesian approximate inference is used in the agent. The data collection strategy is to generate proposals, responses to which help resolving uncertainty associated with prediction of the user responses the most. The efficiency of our approach is validated by numerical simulations.
    Replay-Guided Adversarial Environment Design. (arXiv:2110.02439v2 [cs.LG] UPDATED)
    (0 min) Deep reinforcement learning (RL) agents may successfully generalize to new settings if trained on an appropriately diverse set of environment and task configurations. Unsupervised Environment Design (UED) is a promising self-supervised RL paradigm, wherein the free parameters of an underspecified environment are automatically adapted during training to the agent's capabilities, leading to the emergence of diverse training environments. Here, we cast Prioritized Level Replay (PLR), an empirically successful but theoretically unmotivated method that selectively samples randomly-generated training levels, as UED. We argue that by curating completely random levels, PLR, too, can generate novel and complex levels for effective training. This insight reveals a natural class of UED methods we call Dual Curriculum Design (DCD). Crucially, DCD includes both PLR and a popular UED algorithm, PAIRED, as special cases and inherits similar theoretical guarantees. This connection allows us to develop novel theory for PLR, providing a version with a robustness guarantee at Nash equilibria. Furthermore, our theory suggests a highly counterintuitive improvement to PLR: by stopping the agent from updating its policy on uncurated levels (training on less data), we can improve the convergence to Nash equilibria. Indeed, our experiments confirm that our new method, PLR$^{\perp}$, obtains better results on a suite of out-of-distribution, zero-shot transfer tasks, in addition to demonstrating that PLR$^{\perp}$ improves the performance of PAIRED, from which it inherited its theoretical framework.
    Learning Enhancement of CNNs via Separation Index Maximizing at the First Convolutional Layer. (arXiv:2201.05217v1 [cs.LG])
    (0 min) In this paper, a straightforward enhancement learning algorithm based on Separation Index (SI) concept is proposed for Convolutional Neural Networks (CNNs). At first, the SI as a supervised complexity measure is explained its usage in better learning of CNNs for classification problems illustrate. Then, a learning strategy proposes through which the first layer of a CNN is optimized by maximizing the SI, and the further layers are trained through the backpropagation algorithm to learn further layers. In order to maximize the SI at the first layer, A variant of ranking loss is optimized by using the quasi least square error technique. Applying such a learning strategy to some known CNNs and datasets, its enhancement impact in almost all cases is demonstrated.
    RecoMed: A Knowledge-Aware Recommender System for Hypertension Medications. (arXiv:2201.05461v1 [cs.IR])
    (0 min) Background and Objective High medicine diversity has always been a significant challenge for prescription, causing confusion or doubt in physicians' decision-making process. This paper aims to develop a medicine recommender system called RecoMed to aid the physician in the prescription process of hypertension by providing information about what medications have been prescribed by other doctors and figuring out what other medicines can be recommended in addition to the one in question. Methods There are two steps to the developed method: First, association rule mining algorithms are employed to find medicine association rules. The second step entails graph mining and clustering to present an enriched recommendation via ATC code, which itself comprises several steps. First, the initial graph is constructed from historical prescription data. Then, data pruning is performed in the second step, after which the medicines with a high repetition rate are removed at the discretion of a general medical practitioner. Next, the medicines are matched to a well-known medicine classification system called the ATC code to provide an enriched recommendation. And finally, the DBSCAN and Louvain algorithms cluster medicines in the final step. Results A list of recommended medicines is provided as the system's output, and physicians can choose one or more of the medicines based on the patient's clinical symptoms. Only the medicines of class 2, related to high blood pressure medications, are used to assess the system's performance. The results obtained from this system have been reviewed and confirmed by an expert in this field.
    Compact Graph Structure Learning via Mutual Information Compression. (arXiv:2201.05540v1 [cs.LG])
    (0 min) Graph Structure Learning (GSL) recently has attracted considerable attentions in its capacity of optimizing graph structure as well as learning suitable parameters of Graph Neural Networks (GNNs) simultaneously. Current GSL methods mainly learn an optimal graph structure (final view) from single or multiple information sources (basic views), however the theoretical guidance on what is the optimal graph structure is still unexplored. In essence, an optimal graph structure should only contain the information about tasks while compress redundant noise as much as possible, which is defined as "minimal sufficient structure", so as to maintain the accurancy and robustness. How to obtain such structure in a principled way? In this paper, we theoretically prove that if we optimize basic views and final view based on mutual information, and keep their performance on labels simultaneously, the final view will be a minimal sufficient structure. With this guidance, we propose a Compact GSL architecture by MI compression, named CoGSL. Specifically, two basic views are extracted from original graph as two inputs of the model, which are refinedly reestimated by a view estimator. Then, we propose an adaptive technique to fuse estimated views into the final view. Furthermore, we maintain the performance of estimated views and the final view and reduce the mutual information of every two views. To comprehensively evaluate the performance of CoGSL, we conduct extensive experiments on several datasets under clean and attacked conditions, which demonstrate the effectiveness and robustness of CoGSL.
    AWSnet: An Auto-weighted Supervision Attention Network for Myocardial Scar and Edema Segmentation in Multi-sequence Cardiac Magnetic Resonance Images. (arXiv:2201.05344v1 [eess.IV])
    (0 min) Multi-sequence cardiac magnetic resonance (CMR) provides essential pathology information (scar and edema) to diagnose myocardial infarction. However, automatic pathology segmentation can be challenging due to the difficulty of effectively exploring the underlying information from the multi-sequence CMR data. This paper aims to tackle the scar and edema segmentation from multi-sequence CMR with a novel auto-weighted supervision framework, where the interactions among different supervised layers are explored under a task-specific objective using reinforcement learning. Furthermore, we design a coarse-to-fine framework to boost the small myocardial pathology region segmentation with shape prior knowledge. The coarse segmentation model identifies the left ventricle myocardial structure as a shape prior, while the fine segmentation model integrates a pixel-wise attention strategy with an auto-weighted supervision model to learn and extract salient pathological structures from the multi-sequence CMR data. Extensive experimental results on a publicly available dataset from Myocardial pathology segmentation combining multi-sequence CMR (MyoPS 2020) demonstrate our method can achieve promising performance compared with other state-of-the-art methods. Our method is promising in advancing the myocardial pathology assessment on multi-sequence CMR data. To motivate the community, we have made our code publicly available via https://github.com/soleilssss/AWSnet/tree/master.
    Learning from Mistakes -- A Framework for Neural Architecture Search. (arXiv:2111.06353v2 [cs.LG] UPDATED)
    (0 min) Learning from one's mistakes is an effective human learning technique where the learners focus more on the topics where mistakes were made, so as to deepen their understanding. In this paper, we investigate if this human learning strategy can be applied in machine learning. We propose a novel machine learning method called Learning From Mistakes (LFM), wherein the learner improves its ability to learn by focusing more on the mistakes during revision. We formulate LFM as a three-stage optimization problem: 1) learner learns; 2) learner re-learns focusing on the mistakes, and; 3) learner validates its learning. We develop an efficient algorithm to solve the LFM problem. We apply the LFM framework to neural architecture search on CIFAR-10, CIFAR-100, and Imagenet. Experimental results strongly demonstrate the effectiveness of our model.
    Multi-modal Attention Network for Stock Movements Prediction. (arXiv:2112.13593v2 [cs.LG] UPDATED)
    (0 min) Stock prices move as piece-wise trending fluctuation rather than a purely random walk. Traditionally, the prediction of future stock movements is based on the historical trading record. Nowadays, with the development of social media, many active participants in the market choose to publicize their strategies, which provides a window to glimpse over the whole market's attitude towards future movements by extracting the semantics behind social media. However, social media contains conflicting information and cannot replace historical records completely. In this work, we propose a multi-modality attention network to reduce conflicts and integrate semantic and numeric features to predict future stock movements comprehensively. Specifically, we first extract semantic information from social media and estimate their credibility based on posters' identity and public reputation. Then we incorporate the semantic from online posts and numeric features from historical records to make the trading strategy. Experimental results show that our approach outperforms previous methods by a significant margin in both prediction accuracy (61.20\%) and trading profits (9.13\%). It demonstrates that our method improves the performance of stock movements prediction and informs future research on multi-modality fusion towards stock prediction.
    Assemble Foundation Models for Automatic Code Summarization. (arXiv:2201.05222v1 [cs.SE])
    (0 min) Automatic code summarization is beneficial to software development and maintenance since it reduces the burden of manual tasks. Currently, artificial intelligence is undergoing a paradigm shift. The foundation models pretrained on massive data and finetuned to downstream tasks surpass specially customized models. This trend inspired us to consider reusing foundation models instead of learning from scratch. Based on this, we propose a flexible and robust approach for automatic code summarization based on neural networks. We assemble available foundation models, such as CodeBERT and GPT-2, into a single model named AdaMo. Moreover, we utilize Gaussian noise as the simulation of contextual information to optimize the latent representation. Furthermore, we introduce two adaptive schemes from the perspective of knowledge transfer, namely continuous pretraining and intermediate finetuning, and design intermediate stage tasks for general sequence-to-sequence learning. Finally, we evaluate AdaMo against a benchmark dataset for code summarization, by comparing it with state-of-the-art models.
    Security Orchestration, Automation, and Response Engine for Deployment of Behavioural Honeypots. (arXiv:2201.05326v1 [cs.CR])
    (0 min) Cyber Security is a critical topic for organizations with IT/OT networks as they are always susceptible to attack, whether insider or outsider. Since the cyber landscape is an ever-evolving scenario, one must keep upgrading its security systems to enhance the security of the infrastructure. Tools like Security Information and Event Management (SIEM), Endpoint Detection and Response (EDR), Threat Intelligence Platform (TIP), Information Technology Service Management (ITSM), along with other defensive techniques like Intrusion Detection System (IDS), Intrusion Protection System (IPS), and many others enhance the cyber security posture of the infrastructure. However, the proposed protection mechanisms have their limitations, they are insufficient to ensure security, and the attacker penetrates the network. Deception technology, along with Honeypots, provides a false sense of vulnerability in the target systems to the attackers. The attacker deceived reveals threat intel about their modus operandi. We have developed a Security Orchestration, Automation, and Response (SOAR) Engine that dynamically deploys custom honeypots inside the internal network infrastructure based on the attacker's behavior. The architecture is robust enough to support multiple VLANs connected to the system and used for orchestration. The presence of botnet traffic and DDOS attacks on the honeypots in the network is detected, along with a malware collection system. After being exposed to live traffic for four days, our engine dynamically orchestrated the honeypots 40 times, detected 7823 attacks, 965 DDOS attack packets, and three malicious samples. While our experiments with static honeypots show an average attacker engagement time of 102 seconds per instance, our SOAR Engine-based dynamic honeypots engage attackers on average 3148 seconds.
    Impact of Stop Sets on Stopping Active Learning for Text Classification. (arXiv:2201.05460v1 [cs.IR])
    (0 min) Active learning is an increasingly important branch of machine learning and a powerful technique for natural language processing. The main advantage of active learning is its potential to reduce the amount of labeled data needed to learn high-performing models. A vital aspect of an effective active learning algorithm is the determination of when to stop obtaining additional labeled data. Several leading state-of-the-art stopping methods use a stop set to help make this decision. However, there has been relatively less attention given to the choice of stop set than to the stopping algorithms that are applied on the stop set. Different choices of stop sets can lead to significant differences in stopping method performance. We investigate the impact of different stop set choices on different stopping methods. This paper shows the choice of the stop set can have a significant impact on the performance of stopping methods and the impact is different for stability-based methods from that on confidence-based methods. Furthermore, the unbiased representative stop sets suggested by original authors of methods work better than the systematically biased stop sets used in recently published work, and stopping methods based on stabilizing predictions have stronger performance than confidence-based stopping methods when unbiased representative stop sets are used. We provide the largest quantity of experimental results on the impact of stop sets to date. The findings are important for helping to illuminate the impact of this important aspect of stopping methods that has been under-considered in recently published work and that can have a large practical impact on the performance of stopping methods for important semantic computing applications such as technology assisted review and text classification more broadly.
    Networks with pixels embedding: a method to improve noise resistance in images classification. (arXiv:2005.11679v3 [cs.CV] UPDATED)
    (0 min) In the task of image classification, usually, the network is sensitive to noises. For example, an image of cat with noises might be misclassified as an ostrich. Conventionally, to overcome the problem of noises, one uses the technique of data augmentation, that is, to teach the network to distinguish noises by adding more images with noises in the training dataset. In this work, we provide a noise-resistance network in images classification by introducing a technique of pixel embedding. We test the network with pixel embedding, which is abbreviated as the network with PE, on the mnist database of handwritten digits. It shows that the network with PE outperforms the conventional network on images with noises. The technique of pixel embedding can be used in many tasks of image classification to improve noise resistance.
    Radar Occupancy Prediction with Lidar Supervision while Preserving Long-Range Sensing and Penetrating Capabilities. (arXiv:2112.04282v2 [cs.RO] UPDATED)
    (0 min) Radar shows great potential for autonomous driving by accomplishing long-range sensing under diverse weather conditions. But radar is also a particularly challenging sensing modality due to the radar noises. Recent works have made enormous progress in classifying free and occupied spaces in radar images by leveraging lidar label supervision. However, there are still several unsolved issues. Firstly, the sensing distance of the results is limited by the sensing range of lidar. Secondly, the performance of the results is degenerated by lidar due to the physical sensing discrepancies between the two sensors. For example, some objects visible to lidar are invisible to radar, and some objects occluded in lidar scans are visible in radar images because of the radar's penetrating capability. These sensing differences cause false positive and penetrating capability degeneration, respectively. In this paper, we propose training data preprocessing and polar sliding window inference to solve the issues. The data preprocessing aims to reduce the effect caused by radar-invisible measurements in lidar scans. The polar sliding window inference aims to solve the limited sensing range issue by applying a near-range trained network to the long-range region. Instead of using common Cartesian representation, we propose to use polar representation to reduce the shape dissimilarity between long-range and near-range data. We find that extending a near-range trained network to long-range region inference in the polar space has 4.2 times better IoU than in Cartesian space. Besides, the polar sliding window inference can preserve the radar penetrating capability by changing the viewpoint of the inference region, which makes some occluded measurements seem non-occluded for a pretrained network.
    Fast and accurate waveform modeling of long-haul multi-channel optical fiber transmission using a hybrid model-data driven scheme. (arXiv:2201.05502v1 [eess.SP])
    (0 min) The modeling of optical wave propagation in optical fiber is a task of fast and accurate solving the nonlinear Schr\"odinger equation (NLSE), and can enable the research progress and system design of optical fiber communications, which are the infrastructure of modern communication systems. Traditional modeling of fiber channels using the split-step Fourier method (SSFM) has long been regarded as challenging in long-haul wavelength division multiplexing (WDM) optical fiber communication systems because it is extremely time-consuming. Here we propose a linear-nonlinear feature decoupling distributed (FDD) waveform modeling scheme to model long-haul WDM fiber channel, where the channel linear effects are modelled by the NLSE-derived model-driven methods and the nonlinear effects are modelled by the data-driven deep learning methods. Meanwhile, the proposed scheme only focuses on one-span fiber distance fitting, and then recursively transmits the model to achieve the required transmission distance. The proposed modeling scheme is demonstrated to have high accuracy, high computing speeds, and robust generalization abilities for different optical launch powers, modulation formats, channel numbers and transmission distances. The total running time of FDD waveform modeling scheme for 41-channel 1040-km fiber transmission is only 3 minutes versus more than 2 hours using SSFM for each input condition, which achieves a 98% reduction in computing time. Considering the multi-round optimization by adjusting system parameters, the complexity reduction is significant. The results represent a remarkable improvement in nonlinear fiber modeling and open up novel perspectives for solution of NLSE-like partial differential equations and optical fiber physics problems.
    Self-Supervised Learning with Data Augmentations Provably Isolates Content from Style. (arXiv:2106.04619v4 [stat.ML] UPDATED)
    (0 min) Self-supervised representation learning has shown remarkable success in a number of domains. A common practice is to perform data augmentation via hand-crafted transformations intended to leave the semantics of the data invariant. We seek to understand the empirical success of this approach from a theoretical perspective. We formulate the augmentation process as a latent variable model by postulating a partition of the latent representation into a content component, which is assumed invariant to augmentation, and a style component, which is allowed to change. Unlike prior work on disentanglement and independent component analysis, we allow for both nontrivial statistical and causal dependencies in the latent space. We study the identifiability of the latent representation based on pairs of views of the observations and prove sufficient conditions that allow us to identify the invariant content partition up to an invertible mapping in both generative and discriminative settings. We find numerical simulations with dependent latent variables are consistent with our theory. Lastly, we introduce Causal3DIdent, a dataset of high-dimensional, visually complex images with rich causal dependencies, which we use to study the effect of data augmentations performed in practice.
    Reusing Auto-Schedules for Efficient DNN Compilation. (arXiv:2201.05587v1 [cs.LG])
    (0 min) Auto-scheduling is a process where a search algorithm automatically explores candidate schedules (program transformations) for a given tensor program on a given hardware platform to improve its performance. However this can be a very time consuming process, depending on the complexity of the tensor program, and capacity of the target device, with often many thousands of program variants being explored. To address this, in this paper we introduce and demonstrate the idea of \emph{tuning-reuse}, a novel approach to identify and re-use auto-schedules between tensor programs. We demonstrate this concept using Deep Neural Networks (DNNs), taking sets of auto-schedules from pre-tuned DNNs, and using them to reduce the inference time of a new DNN. Given a set of pre-tuned schedules, tuning-reuse provides its maximum speedup in less time than auto-scheduling using the state-of-the-art Ansor auto-scheduler. On a set of widely used DNN models, we apply tuning-reuse and achieve maximum speedups between $1.16\times$ and $4.76\times$, while outperforming Ansor when given limited tuning time.
    Study of Frequency domain exponential functional link network filters. (arXiv:2201.05501v1 [eess.SP])
    (0 min) The exponential functional link network (EFLN) filter has attracted tremendous interest due to its enhanced nonlinear modeling capability. However, the computational complexity will dramatically increase with the dimension growth of the EFLN-based filter. To improve the computational efficiency, we propose a novel frequency domain exponential functional link network (FDEFLN) filter in this paper. The idea is to organize the samples in blocks of expanded input data, transform them from time domain to frequency domain, and thus execute the filtering and adaptation procedures in frequency domain with the overlap-save method. A FDEFLN-based nonlinear active noise control (NANC) system has also been developed to form the frequency domain exponential filtered-s least mean-square (FDEFsLMS) algorithm. Moreover, the stability, steady-state performance and computational complexity of algorithms are analyzed. Finally, several numerical experiments corroborate the proposed FDEFLN-based algorithms in nonlinear system identification, acoustic echo cancellation and NANC implementations, which demonstrate much better computational efficiency.
    Waveform Learning for Reduced Out-of-Band Emissions Under a Nonlinear Power Amplifier. (arXiv:2201.05524v1 [eess.SP])
    (0 min) Machine learning (ML) has shown great promise in optimizing various aspects of the physical layer processing in wireless communication systems. In this paper, we use ML to learn jointly the transmit waveform and the frequency-domain receiver. In particular, we consider a scenario where the transmitter power amplifier is operating in a nonlinear manner, and ML is used to optimize the waveform to minimize the out-of-band emissions. The system also learns a constellation shape that facilitates pilotless detection by the simultaneously learned receiver. The simulation results show that such an end-to-end optimized system can communicate data more accurately and with less out-of-band emissions than conventional systems, thereby demonstrating the potential of ML in optimizing the air interface. To the best of our knowledge, there are no prior works considering the power amplifier induced emissions in an end-to-end learned system. These findings pave the way towards an ML-native air interface, which could be one of the building blocks of 6G.
    DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale. (arXiv:2201.05596v1 [cs.LG])
    (0 min) As the training of giant dense models hits the boundary on the availability and capability of the hardware resources today, Mixture-of-Experts (MoE) models become one of the most promising model architectures due to their significant training cost reduction compared to a quality-equivalent dense model. Its training cost saving is demonstrated from encoder-decoder models (prior works) to a 5x saving for auto-aggressive language models (this work along with parallel explorations). However, due to the much larger model size and unique architecture, how to provide fast MoE model inference remains challenging and unsolved, limiting its practical usage. To tackle this, we present DeepSpeed-MoE, an end-to-end MoE training and inference solution as part of the DeepSpeed library, including novel MoE architecture designs and model compression techniques that reduce MoE model size by up to 3.7x, and a highly optimized inference system that provides 7.3x better latency and cost compared to existing MoE inference solutions. DeepSpeed-MoE offers an unprecedented scale and efficiency to serve massive MoE models with up to 4.5x faster and 9x cheaper inference compared to quality-equivalent dense models. We hope our innovations and systems help open a promising path to new directions in the large model landscape, a shift from dense to sparse MoE models, where training and deploying higher-quality models with fewer resources becomes more widely possible.
    Generalized Approach to Matched Filtering using Neural Networks. (arXiv:2104.03961v2 [astro-ph.IM] UPDATED)
    (0 min) Gravitational wave science is a pioneering field with rapidly evolving data analysis methodology currently assimilating and inventing deep learning techniques. The bulk of the sophisticated flagship searches of the field rely on the time-tested matched filtering principle within their core. In this paper, we make a key observation on the relationship between the emerging deep learning and the traditional techniques: matched filtering is formally equivalent to a particular neural network. This means that a neural network can be constructed analytically to exactly implement matched filtering, and can be further trained on data or boosted with additional complexity for improved performance. Moreover, we show that the proposed neural network architecture can outperform matched filtering, both with or without knowledge of a prior on the parameter distribution. When a prior is given, the proposed neural network can approach the statistically optimal performance. We also propose and investigate two different neural network architectures MNet-Shallow and MNet-Deep, both of which implement matched filtering at initialization and can be trained on data. MNet-Shallow has simpler structure, while MNet-Deep is more flexible and can deal with a wider range of distributions. Our theoretical findings are corroborated by experiments using real LIGO data and synthetic injections, where our proposed methods significantly outperform matched filtering at false positive rates above $5\times 10^{-3}\%$. The fundamental equivalence between matched filtering and neural networks allows us to define a "complexity standard candle" to characterize the relative complexity of the different approaches to gravitational wave signal searches in a common framework. Finally, our results suggest new perspectives on the role of deep learning in gravitational wave detection.
    Reinforcement Learning to Solve NP-hard Problems: an Application to the CVRP. (arXiv:2201.05393v1 [cs.AI])
    (0 min) In this paper, we evaluate the use of Reinforcement Learning (RL) to solve a classic combinatorial optimization problem: the Capacitated Vehicle Routing Problem (CVRP). We formalize this problem in the RL framework and compare two of the most promising RL approaches with traditional solving techniques on a set of benchmark instances. We measure the different approaches with the quality of the solution returned and the time required to return it. We found that despite not returning the best solution, the RL approach has many advantages over traditional solvers. First, the versatility of the framework allows the resolution of more complex combinatorial problems. Moreover, instead of trying to solve a specific instance of the problem, the RL algorithm learns the skills required to solve the problem. The trained policy can then quasi instantly provide a solution to an unseen problem without having to solve it from scratch. Finally, the use of trained models makes the RL solver by far the fastest, and therefore make this approach more suited for commercial use where the user experience is paramount. Techniques like Knowledge Transfer can also be used to improve the training efficiency of the algorithm and help solve bigger and more complex problems.
    Path Regularization: A Convexity and Sparsity Inducing Regularization for Parallel ReLU Networks. (arXiv:2110.09548v3 [cs.LG] UPDATED)
    (0 min) Understanding the fundamental principles behind the success of deep neural networks is one of the most important open questions in the current literature. To this end, we study the training problem of deep neural networks and introduce an analytic approach to unveil hidden convexity in the optimization landscape. We consider a deep parallel ReLU network architecture, which also includes standard deep networks and ResNets as its special cases. We then show that pathwise regularized training problems can be represented as an exact convex optimization problem. We further prove that the equivalent convex problem is regularized via a group sparsity inducing norm. Thus, a path regularized parallel ReLU network can be viewed as a parsimonious convex model in high dimensions. More importantly, we show that the computational complexity required to globally optimize the equivalent convex problem is fully polynomial-time in feature dimension and number of samples. Therefore, we prove polynomial-time trainability of path regularized ReLU networks with global optimality guarantees. We also provide several numerical experiments corroborating our theory.
    Reinforcement Learning in Time-Varying Systems: an Empirical Study. (arXiv:2201.05560v1 [cs.LG])
    (0 min) Recent research has turned to Reinforcement Learning (RL) to solve challenging decision problems, as an alternative to hand-tuned heuristics. RL can learn good policies without the need for modeling the environment's dynamics. Despite this promise, RL remains an impractical solution for many real-world systems problems. A particularly challenging case occurs when the environment changes over time, i.e. it exhibits non-stationarity. In this work, we characterize the challenges introduced by non-stationarity and develop a framework for addressing them to train RL agents in live systems. Such agents must explore and learn new environments, without hurting the system's performance, and remember them over time. To this end, our framework (1) identifies different environments encountered by the live system, (2) explores and trains a separate expert policy for each environment, and (3) employs safeguards to protect the system's performance. We apply our framework to two systems problems: straggler mitigation and adaptive video streaming, and evaluate it against a variety of alternative approaches using real-world and synthetic data. We show that each component of our framework is necessary to cope with non-stationarity.
    Unsupervised Temporal Video Grounding with Deep Semantic Clustering. (arXiv:2201.05307v1 [cs.CV])
    (0 min) Temporal video grounding (TVG) aims to localize a target segment in a video according to a given sentence query. Though respectable works have made decent achievements in this task, they severely rely on abundant video-query paired data, which is expensive and time-consuming to collect in real-world scenarios. In this paper, we explore whether a video grounding model can be learned without any paired annotations. To the best of our knowledge, this paper is the first work trying to address TVG in an unsupervised setting. Considering there is no paired supervision, we propose a novel Deep Semantic Clustering Network (DSCNet) to leverage all semantic information from the whole query set to compose the possible activity in each video for grounding. Specifically, we first develop a language semantic mining module, which extracts implicit semantic features from the whole query set. Then, these language semantic features serve as the guidance to compose the activity in video via a video-based semantic aggregation module. Finally, we utilize a foreground attention branch to filter out the redundant background activities and refine the grounding results. To validate the effectiveness of our DSCNet, we conduct experiments on both ActivityNet Captions and Charades-STA datasets. The results demonstrate that DSCNet achieves competitive performance, and even outperforms most weakly-supervised approaches.
    A causal view on compositional data. (arXiv:2106.11234v2 [cs.LG] UPDATED)
    (0 min) Many scientific datasets are compositional in nature. Important examples include species abundances in ecology, rock compositions in geology, topic compositions in large-scale text corpora, and sequencing count data in molecular biology. Here, we provide a causal view on compositional data in an instrumental variable setting where the composition acts as the cause. First, we crisply articulate potential pitfalls for practitioners regarding the interpretation of compositional causes from the viewpoint of interventions and warn against attributing causal meaning to common summary statistics such as diversity indices. We then advocate for and develop multivariate methods using statistical data transformations and regression techniques that take the special structure of the compositional sample space into account. In a comparative analysis on synthetic and real data we show the advantages and limitations of our proposal. We posit that our framework provides a useful starting point and guidance for valid and informative cause-effect estimation in the context of compositional data.
    A Kernel-Expanded Stochastic Neural Network. (arXiv:2201.05319v1 [stat.ML])
    (0 min) The deep neural network suffers from many fundamental issues in machine learning. For example, it often gets trapped into a local minimum in training, and its prediction uncertainty is hard to be assessed. To address these issues, we propose the so-called kernel-expanded stochastic neural network (K-StoNet) model, which incorporates support vector regression (SVR) as the first hidden layer and reformulates the neural network as a latent variable model. The former maps the input vector into an infinite dimensional feature space via a radial basis function (RBF) kernel, ensuring absence of local minima on its training loss surface. The latter breaks the high-dimensional nonconvex neural network training problem into a series of low-dimensional convex optimization problems, and enables its prediction uncertainty easily assessed. The K-StoNet can be easily trained using the imputation-regularized optimization (IRO) algorithm. Compared to traditional deep neural networks, K-StoNet possesses a theoretical guarantee to asymptotically converge to the global optimum and enables the prediction uncertainty easily assessed. The performances of the new model in training, prediction and uncertainty quantification are illustrated by simulated and real data examples.
    Examining and Mitigating the Impact of Crossbar Non-idealities for Accurate Implementation of Sparse Deep Neural Networks. (arXiv:2201.05229v1 [cs.LG])
    (0 min) Recently several structured pruning techniques have been introduced for energy-efficient implementation of Deep Neural Networks (DNNs) with lesser number of crossbars. Although, these techniques have claimed to preserve the accuracy of the sparse DNNs on crossbars, none have studied the impact of the inexorable crossbar non-idealities on the actual performance of the pruned networks. To this end, we perform a comprehensive study to show how highly sparse DNNs, that result in significant crossbar-compression-rate, can lead to severe accuracy losses compared to unpruned DNNs mapped onto non-ideal crossbars. We perform experiments with multiple structured-pruning approaches (such as, C/F pruning, XCS and XRS) on VGG11 and VGG16 DNNs with benchmark datasets (CIFAR10 and CIFAR100). We propose two mitigation approaches - Crossbar column rearrangement and Weight-Constrained-Training (WCT) - that can be integrated with the crossbar-mapping of the sparse DNNs to minimize accuracy losses incurred by the pruned models. These help in mitigating non-idealities by increasing the proportion of low conductance synapses on crossbars, thereby improving their computational accuracies.
    Importance sampling based active learning for parametric seismic fragility curve estimation. (arXiv:2109.04323v2 [stat.ML] UPDATED)
    (0 min) The key elements of seismic probabilistic risk assessment studies are the fragility curves which express the probabilities of failure of structures conditional to a seismic intensity measure. A multitude of procedures is currently available to estimate these curves. For modeling-based approaches which may involve complex and expensive numerical models, the main challenge is to optimize the calls to the numerical codes to reduce the estimation costs. Adaptive techniques can be used for this purpose, but in doing so, taking into account the uncertainties of the estimates (via confidence intervals or ellipsoids related to the size of the samples used) is an arduous task because the samples are no longer independent and possibly not identically distributed. The main contribution of this work is to deal with this question in a mathematical and rigorous way. To this end, we propose and implement an active learning methodology based on adaptive importance sampling for parametric estimations of fragility curves. We prove some theoretical properties (consistency and asymptotic normality) for the estimator of interest. Moreover, we give a convergence criterion in order to use asymptotic confidence ellipsoids. Finally, the performances of the methodology are evaluated on analytical and industrial test cases of increasing complexity.
    SPLDExtraTrees: Robust machine learning approach for predicting kinase inhibitor resistance. (arXiv:2111.08008v4 [q-bio.QM] UPDATED)
    (0 min) Drug resistance is a major threat to the global health and a significant concern throughout the clinical treatment of diseases and drug development. The mutation in proteins that is related to drug binding is a common cause for adaptive drug resistance. Therefore, quantitative estimations of how mutations would affect the interaction between a drug and the target protein would be of vital significance for the drug development and the clinical practice. Computational methods that rely on molecular dynamics simulations, Rosetta protocols, as well as machine learning methods have been proven to be capable of predicting ligand affinity changes upon protein mutation. However, the severely limited sample size and heavy noise induced overfitting and generalization issues have impeded wide adoption of machine learning for studying drug resistance. In this paper, we propose a robust machine learning method, termed SPLDExtraTrees, which can accurately predict ligand binding affinity changes upon protein mutation and identify resistance-causing mutations. Especially, the proposed method ranks training data following a specific scheme that starts with easy-to-learn samples and gradually incorporates harder and diverse samples into the training, and then iterates between sample weight recalculations and model updates. In addition, we calculate additional physics-based structural features to provide the machine learning model with the valuable domain knowledge on proteins for this data-limited predictive tasks. The experiments substantiate the capability of the proposed method for predicting kinase inhibitor resistance under three scenarios, and achieves predictive accuracy comparable to that of molecular dynamics and Rosetta methods with much less computational costs.
    Comparing Model-free and Model-based Algorithms for Offline Reinforcement Learning. (arXiv:2201.05433v1 [cs.LG])
    (0 min) Offline reinforcement learning (RL) Algorithms are often designed with environments such as MuJoCo in mind, in which the planning horizon is extremely long and no noise exists. We compare model-free, model-based, as well as hybrid offline RL approaches on various industrial benchmark (IB) datasets to test the algorithms in settings closer to real world problems, including complex noise and partially observable states. We find that on the IB, hybrid approaches face severe difficulties and that simpler algorithms, such as rollout based algorithms or model-free algorithms with simpler regularizers perform best on the datasets.
    Combining No-regret and Q-learning. (arXiv:1910.03094v3 [cs.LG] UPDATED)
    (0 min) Counterfactual Regret Minimization (CFR) has found success in settings like poker which have both terminal states and perfect recall. We seek to understand how to relax these requirements. As a first step, we introduce a simple algorithm, local no-regret learning (LONR), which uses a Q-learning-like update rule to allow learning without terminal states or perfect recall. We prove its convergence for the basic case of MDPs (and limited extensions of them) and present empirical results showing that it achieves last iterate convergence in a number of settings, most notably NoSDE games, a class of Markov games specifically designed to be challenging to learn where no prior algorithm is known to achieve convergence to a stationary equilibrium even on average.
    PDO-eS2CNNs: Partial Differential Operator Based Equivariant Spherical CNNs. (arXiv:2104.03584v2 [cs.CV] UPDATED)
    (0 min) Spherical signals exist in many applications, e.g., planetary data, LiDAR scans and digitalization of 3D objects, calling for models that can process spherical data effectively. It does not perform well when simply projecting spherical data into the 2D plane and then using planar convolution neural networks (CNNs), because of the distortion from projection and ineffective translation equivariance. Actually, good principles of designing spherical CNNs are avoiding distortions and converting the shift equivariance property in planar CNNs to rotation equivariance in the spherical domain. In this work, we use partial differential operators (PDOs) to design a spherical equivariant CNN, PDO-eS2CNN, which is exactly rotation equivariant in the continuous domain. We then discretize PDO-eS2CNNs, and analyze the equivariance error resulted from discretization. This is the first time that the equivariance error is theoretically analyzed in the spherical domain. In experiments, PDO-eS2CNNs show greater parameter efficiency and outperform other spherical CNNs significantly on several tasks.
    Generic Itemset Mining Based on Reinforcement Learning. (arXiv:2105.07753v2 [cs.DB] UPDATED)
    (0 min) One of the biggest problems in itemset mining is the requirement of developing a data structure or algorithm, every time a user wants to extract a different type of itemsets. To overcome this, we propose a method, called Generic Itemset Mining based on Reinforcement Learning (GIM-RL), that offers a unified framework to train an agent for extracting any type of itemsets. In GIM-RL, the environment formulates iterative steps of extracting a target type of itemsets from a dataset. At each step, an agent performs an action to add or remove an item to or from the current itemset, and then obtains from the environment a reward that represents how relevant the itemset resulting from the action is to the target type. Through numerous trial-and-error steps where various rewards are obtained by diverse actions, the agent is trained to maximise cumulative rewards so that it acquires the optimal action policy for forming as many itemsets of the target type as possible. In this framework, an agent for extracting any type of itemsets can be trained as long as a reward suitable for the type can be defined. The extensive experiments on mining high utility itemsets, frequent itemsets and association rules show the general effectiveness and one remarkable potential (agent transfer) of GIM-RL. We hope that GIM-RL opens a new research direction towards learning-based itemset mining.
    HYLDA: End-to-end Hybrid Learning Domain Adaptation for LiDAR Semantic Segmentation. (arXiv:2201.05585v1 [cs.CV])
    (0 min) In this paper we address the problem of training a LiDAR semantic segmentation network using a fully-labeled source dataset and a target dataset that only has a small number of labels. To this end, we develop a novel image-to-image translation engine, and couple it with a LiDAR semantic segmentation network, resulting in an integrated domain adaptation architecture we call HYLDA. To train the system end-to-end, we adopt a diverse set of learning paradigms, including 1) self-supervision on a simple auxiliary reconstruction task, 2) semi-supervised training using a few available labeled target domain frames, and 3) unsupervised training on the fake translated images generated by the image-to-image translation stage, together with the labeled frames from the source domain. In the latter case, the semantic segmentation network participates in the updating of the image-to-image translation engine. We demonstrate experimentally that HYLDA effectively addresses the challenging problem of improving generalization on validation data from the target domain when only a few target labeled frames are available for training. We perform an extensive evaluation where we compare HYLDA against strong baseline methods using two publicly available LiDAR semantic segmentation datasets.
    How I learned to stop worrying and love the curse of dimensionality: an appraisal of cluster validation in high-dimensional spaces. (arXiv:2201.05214v1 [cs.LG])
    (0 min) The failure of the Euclidean norm to reliably distinguish between nearby and distant points in high dimensional space is well-known. This phenomenon of distance concentration manifests in a variety of data distributions, with iid or correlated features, including centrally-distributed and clustered data. Unsupervised learning based on Euclidean nearest-neighbors and more general proximity-oriented data mining tasks like clustering, might therefore be adversely affected by distance concentration for high-dimensional applications. While considerable work has been done developing clustering algorithms with reliable high-dimensional performance, the problem of cluster validation--of determining the natural number of clusters in a dataset--has not been carefully examined in high-dimensional problems. In this work we investigate how the sensitivities of common Euclidean norm-based cluster validity indices scale with dimension for a variety of synthetic data schemes, including well-separated and noisy clusters, and find that the overwhelming majority of indices have improved or stable sensitivity in high dimensions. The curse of dimensionality is therefore dispelled for this class of fairly generic data schemes.
    Voice2Series: Reprogramming Acoustic Models for Time Series Classification. (arXiv:2106.09296v3 [cs.LG] UPDATED)
    (2 min) Learning to classify time series with limited data is a practical yet challenging problem. Current methods are primarily based on hand-designed feature extraction rules or domain-specific data augmentation. Motivated by the advances in deep speech processing models and the fact that voice data are univariate temporal signals, in this paper, we propose Voice2Series (V2S), a novel end-to-end approach that reprograms acoustic models for time series classification, through input transformation learning and output label mapping. Leveraging the representation learning power of a large-scale pre-trained speech processing model, on 30 different time series tasks we show that V2S performs competitive results on 19 time series classification tasks. We further provide a theoretical justification of V2S by proving its population risk is upper bounded by the source risk and a Wasserstein distance accounting for feature alignment via reprogramming. Our results offer new and effective means to time series classification.
    Parallel Neural Local Lossless Compression. (arXiv:2201.05213v1 [eess.IV])
    (2 min) The recently proposed Neural Local Lossless Compression (NeLLoC), which is based on a local autoregressive model, has achieved state-of-the-art (SOTA) out-of-distribution (OOD) generalization performance in the image compression task. In addition to the encouragement of OOD generalization, the local model also allows parallel inference in the decoding stage. In this paper, we propose a parallelization scheme for local autoregressive models. We discuss the practicalities of implementing this scheme, and provide experimental evidence of significant gains in compression runtime compared to the previous, non-parallel implementation.
    Adaptive Transfer Learning for Plant Phenotyping. (arXiv:2201.05261v1 [cs.LG])
    (2 min) Plant phenotyping (Guo et al. 2021; Pieruschka et al. 2019) focuses on studying the diverse traits of plants related to the plants' growth. To be more specific, by accurately measuring the plant's anatomical, ontogenetical, physiological and biochemical properties, it allows identifying the crucial factors of plants' growth in different environments. One commonly used approach is to predict the plant's traits using hyperspectral reflectance (Yendrek et al. 2017; Wang et al. 2021). However, the data distributions of the hyperspectral reflectance data in plant phenotyping might vary in different environments for different plants. That is, it would be computationally expansive to learn the machine learning models separately for one plant in different environments. To solve this problem, we focus on studying the knowledge transferability of modern machine learning models in plant phenotyping. More specifically, this work aims to answer the following questions. (1) How is the performance of conventional machine learning models, e.g., partial least squares regression (PLSR), Gaussian process regression (GPR) and multi-layer perceptron (MLP), affected by the number of annotated samples for plant phenotyping? (2) Whether could the neural network based transfer learning models improve the performance of plant phenotyping? (3) Could the neural network based transfer learning be improved by using infinite-width hidden layers for plant phenotyping?
    A Survey of Knowledge-Enhanced Text Generation. (arXiv:2010.04389v3 [cs.CL] UPDATED)
    (2 min) The goal of text generation is to make machines express in human language. It is one of the most important yet challenging tasks in natural language processing (NLP). Since 2014, various neural encoder-decoder models pioneered by Seq2Seq have been proposed to achieve the goal by learning to map input text to output text. However, the input text alone often provides limited knowledge to generate the desired output, so the performance of text generation is still far from satisfaction in many real-world scenarios. To address this issue, researchers have considered incorporating various forms of knowledge beyond the input text into the generation models. This research direction is known as knowledge-enhanced text generation. In this survey, we present a comprehensive review of the research on knowledge enhanced text generation over the past five years. The main content includes two parts: (i) general methods and architectures for integrating knowledge into text generation; (ii) specific techniques and applications according to different forms of knowledge data. This survey can have broad audiences, researchers and practitioners, in academia and industry.
    De Rham compatible Deep Neural Networks. (arXiv:2201.05395v1 [math.NA])
    (2 min) We construct several classes of neural networks with ReLU and BiSU (Binary Step Unit) activations, which exactly emulate the lowest order Finite Element (FE) spaces on regular, simplicial partitions of polygonal and polyhedral domains $\Omega \subset \mathbb{R}^d$, $d=2,3$. For continuous, piecewise linear (CPwL) functions, our constructions generalize previous results in that arbitrary, regular simplicial partitions of $\Omega$ are admitted, also in arbitrary dimension $d\geq 2$. Vector-valued elements emulated include the classical Raviart-Thomas and the first family of N\'{e}d\'{e}lec edge elements on triangles and tetrahedra. Neural Networks emulating these FE spaces are required in the correct approximation of boundary value problems of electromagnetism in nonconvex polyhedra $\Omega \subset \mathbb{R}^3$, thereby constituting an essential ingredient in the application of e.g. the methodology of ``physics-informed NNs'' or ``deep Ritz methods'' to electromagnetic field simulation via deep learning techniques. They satisfy exact (De Rham) sequence properties, and also spawn discrete boundary complexes on $\partial\Omega$ which satisfy exact sequence properties for the surface divergence and curl operators $\mathrm{div}_\Gamma$ and $\mathrm{curl}_\Gamma$, respectively, thereby enabling ``neural boundary elements'' for computational electromagnetism. We indicate generalizations of our constructions to higher-order compatible spaces and other, non-compatible classes of discretizations in particular the Crouzeix-Raviart elements and Hybridized, Higher Order (HHO) methods.
    DapStep: Deep Assignee Prediction for Stack Trace Error rePresentation. (arXiv:2201.05256v1 [cs.SE])
    (2 min) The task of finding the best developer to fix a bug is called bug triage. Most of the existing approaches consider the bug triage task as a classification problem, however, classification is not appropriate when the sets of classes change over time (as developers often do in a project). Furthermore, to the best of our knowledge, all the existing models use textual sources of information, i.e., bug descriptions, which are not always available. In this work, we explore the applicability of existing solutions for the bug triage problem when stack traces are used as the main data source of bug reports. Additionally, we reformulate this task as a ranking problem and propose new deep learning models to solve it. The models are based on a bidirectional recurrent neural network with attention and on a convolutional neural network, with the weights of the models optimized using a ranking loss function. To improve the quality of ranking, we propose using additional information from version control system annotations. Two approaches are proposed for extracting features from annotations: manual and using an additional neural network. To evaluate our models, we collected two datasets of real-world stack traces. Our experiments show that the proposed models outperform existing models adapted to handle stack traces. To facilitate further research in this area, we publish the source code of our models and one of the collected datasets.
    Position-enhanced and Time-aware Graph Convolutional Network for Sequential Recommendations. (arXiv:2107.05235v2 [cs.IR] UPDATED)
    (2 min) Most of the existing deep learning-based sequential recommendation approaches utilize the recurrent neural network architecture or self-attention to model the sequential patterns and temporal influence among a user's historical behavior and learn the user's preference at a specific time. However, these methods have two main drawbacks. First, they focus on modeling users' dynamic states from a user-centric perspective and always neglect the dynamics of items over time. Second, most of them deal with only the first-order user-item interactions and do not consider the high-order connectivity between users and items, which has recently been proved helpful for the sequential recommendation. To address the above problems, in this article, we attempt to model user-item interactions by a bipartite graph structure and propose a new recommendation approach based on a Position-enhanced and Time-aware Graph Convolutional Network (PTGCN) for the sequential recommendation. PTGCN models the sequential patterns and temporal dynamics between user-item interactions by defining a position-enhanced and time-aware graph convolution operation and learning the dynamic representations of users and items simultaneously on the bipartite graph with a self-attention aggregator. Also, it realizes the high-order connectivity between users and items by stacking multi-layer graph convolutions. To demonstrate the effectiveness of PTGCN, we carried out a comprehensive evaluation of PTGCN on three real-world datasets of different sizes compared with a few competitive baselines. Experimental results indicate that PTGCN outperforms several state-of-the-art models in terms of two commonly-used evaluation metrics for ranking.
    Training Free Graph Neural Networks for Graph Matching. (arXiv:2201.05349v1 [cs.LG])
    (2 min) We present TFGM (Training Free Graph Matching), a framework to boost the performance of Graph Neural Networks (GNNs) based graph matching without training. TFGM sidesteps two crucial problems when training GNNs: 1) the limited supervision due to expensive annotation, and 2) training's computational cost. A basic framework, BasicTFGM, is first proposed by adopting the inference stage of graph matching methods. Our analysis shows that the BasicTFGM is a linear relaxation to the quadratic assignment formulation of graph matching. This guarantees the preservation of structure compatibility and an efficient polynomial complexity. Empirically, we further improve the BasicTFGM by handcrafting two types of matching priors into the architecture of GNNs: comparing node neighborhoods of different localities and utilizing annotation data if available. For evaluation, we conduct extensive experiments on a broad set of settings, including supervised keypoint matching between images, semi-supervised entity alignment between knowledge graphs, and unsupervised alignment between protein interaction networks. Applying TFGM on various GNNs shows promising improvements over baselines. Further ablation studies demonstrate the effective and efficient training-free property of TFGM. Our code is available at https://github.com/acharkq/Training-Free-Graph-Matching.
    A causal model of safety assurance for machine learning. (arXiv:2201.05451v1 [cs.SE])
    (2 min) This paper proposes a framework based on a causal model of safety upon which effective safety assurance cases for ML-based applications can be built. In doing so, we build upon established principles of safety engineering as well as previous work on structuring assurance arguments for ML. The paper defines four categories of safety case evidence and a structured analysis approach within which these evidences can be effectively combined. Where appropriate, abstract formalisations of these contributions are used to illustrate the causalities they evaluate, their contributions to the safety argument and desirable properties of the evidences. Based on the proposed framework, progress in this area is re-evaluated and a set of future research directions proposed in order for tangible progress in this field to be made.
    Leveraging Intrinsic Gradient Information for Further Training of Differentiable Machine Learning Models. (arXiv:2112.00094v2 [cs.LG] UPDATED)
    (2 min) Designing models that produce accurate predictions is the fundamental objective of machine learning (ML). This work presents methods demonstrating that when the derivatives of target variables (outputs) with respect to inputs can be extracted from processes of interest, e.g., neural networks (NN) based surrogate models, they can be leveraged to further improve the accuracy of differentiable ML models. This paper generalises the idea and provides practical methodologies that can be used to leverage gradient information (GI) across a variety of applications including: (1) Improving the performance of generative adversarial networks (GANs); (2) efficiently tuning NN model complexity; (3) regularising linear regressions. Numerical results show that GI can effective enhance ML models with existing datasets, demonstrating its value for a variety of applications.
    LeapfrogLayers: A Trainable Framework for Effective Topological Sampling. (arXiv:2112.01582v2 [hep-lat] UPDATED)
    (2 min) We introduce LeapfrogLayers, an invertible neural network architecture that can be trained to efficiently sample the topology of a 2D $U(1)$ lattice gauge theory. We show an improvement in the integrated autocorrelation time of the topological charge when compared with traditional HMC, and look at how different quantities transform under our model. Our implementation is open source, and is publicly available on github at https://github.com/saforem2/l2hmc-qcd.
    Cross-token Modeling with Conditional Computation. (arXiv:2109.02008v3 [cs.LG] UPDATED)
    (2 min) Mixture-of-Experts (MoE), a conditional computation architecture, achieved promising performance by scaling local module (i.e. feed-forward network) of transformer. However, scaling the cross-token module (i.e. self-attention) is challenging due to the unstable training. This work proposes Sparse-MLP, an all-MLP model which applies sparsely-activated MLPs to cross-token modeling. Specifically, in each Sparse block of our all-MLP model, we apply two stages of MoE layers: one with MLP experts mixing information within channels along image patch dimension, the other with MLP experts mixing information within patches along the channel dimension. In addition, by proposing importance-score routing strategy for MoE and redesigning the image representation shape, we further improve our model's computational efficiency. Experimentally, we are more computation-efficient than Vision Transformers with comparable accuracy. Also, our models can outperform MLP-Mixer by 2.5\% on ImageNet Top-1 accuracy with fewer parameters and computational cost. On downstream tasks, i.e. Cifar10 and Cifar100, our models can still achieve better performance than baselines.
    Estimating Gaussian Copulas with Missing Data. (arXiv:2201.05565v1 [stat.ML])
    (2 min) In this work we present a rigorous application of the Expectation Maximization algorithm to determine the marginal distributions and the dependence structure in a Gaussian copula model with missing data. We further show how to circumvent a priori assumptions on the marginals with semiparametric modelling. The joint distribution learned through this algorithm is considerably closer to the underlying distribution than existing methods.
    SympOCnet: Solving optimal control problems with applications to high-dimensional multi-agent path planning problems. (arXiv:2201.05475v1 [math.OC])
    (2 min) Solving high-dimensional optimal control problems in real-time is an important but challenging problem, with applications to multi-agent path planning problems, which have drawn increased attention given the growing popularity of drones in recent years. In this paper, we propose a novel neural network method called SympOCnet that applies the Symplectic network to solve high-dimensional optimal control problems with state constraints. We present several numerical results on path planning problems in two-dimensional and three-dimensional spaces. Specifically, we demonstrate that our SympOCnet can solve a problem with more than 500 dimensions in 1.5 hours on a single GPU, which shows the effectiveness and efficiency of SympOCnet. The proposed method is scalable and has the potential to solve truly high-dimensional path planning problems in real-time.
    Manifoldron: Direct Space Partition via Manifold Discovery. (arXiv:2201.05279v1 [cs.LG])
    (2 min) A neural network with the widely-used ReLU activation has been shown to partition the sample space into many convex polytopes for prediction. However, the parameterized way a neural network and other machine learning models use to partition the space has imperfections, e.g., the compromised interpretability for complex models, the inflexibility in decision boundary construction due to the generic character of the model, and the risk of being trapped into shortcut solutions. In contrast, although the non-parameterized models can adorably avoid or downplay these issues, they are usually insufficiently powerful either due to over-simplification or the failure to accommodate the manifold structures of data. In this context, we first propose a new type of machine learning models referred to as Manifoldron that directly derives decision boundaries from data and partitions the space via manifold structure discovery. Then, we systematically analyze the key characteristics of the Manifoldron including interpretability, manifold characterization capability, and its link to neural networks. The experimental results on 9 small and 11 large datasets demonstrate that the proposed Manifoldron performs competitively compared to the mainstream machine learning models. We have shared our code https://github.com/wdayang/Manifoldron for free download and evaluation.
    N-Cloth: Predicting 3D Cloth Deformation with Mesh-Based Networks. (arXiv:2112.06397v2 [cs.GR] UPDATED)
    (2 min) We present a novel mesh-based learning approach (N-Cloth) for plausible 3D cloth deformation prediction. Our approach is general and can handle cloth or obstacles represented by triangle meshes with arbitrary topologies. We use graph convolution to transform the cloth and object meshes into a latent space to reduce the non-linearity in the mesh space. Our network can predict the target 3D cloth mesh deformation based on the initial state of the cloth mesh template and the target obstacle mesh. Our approach can handle complex cloth meshes with up to 100K triangles and scenes with various objects corresponding to SMPL humans, non-SMPL humans or rigid bodies. In practice, our approach can be used to generate plausible cloth simulation at 30-45 fps on an NVIDIA GeForce RTX 3090 GPU. We highlight its benefits over prior learning-based methods and physically-based cloth simulators.
    Multi-head Temporal Attention-Augmented Bilinear Network for Financial time series prediction. (arXiv:2201.05459v1 [cs.LG])
    (2 min) Financial time-series forecasting is one of the most challenging domains in the field of time-series analysis. This is mostly due to the highly non-stationary and noisy nature of financial time-series data. With progressive efforts of the community to design specialized neural networks incorporating prior domain knowledge, many financial analysis and forecasting problems have been successfully tackled. The temporal attention mechanism is a neural layer design that recently gained popularity due to its ability to focus on important temporal events. In this paper, we propose a neural layer based on the ideas of temporal attention and multi-head attention to extend the capability of the underlying neural network in focusing simultaneously on multiple temporal instances. The effectiveness of our approach is validated using large-scale limit-order book market data to forecast the direction of mid-price movements. Our experiments show that the use of multi-head temporal attention modules leads to enhanced prediction performances compared to baseline models.
    Attention-Based Recommendation On Graphs. (arXiv:2201.05499v1 [cs.IR])
    (2 min) Graph Neural Networks (GNN) have shown remarkable performance in different tasks. However, there are a few studies about GNN on recommender systems. GCN as a type of GNNs can extract high-quality embeddings for different entities in a graph. In a collaborative filtering task, the core problem is to find out how informative an entity would be for predicting the future behavior of a target user. Using an attention mechanism, we can enable GCNs to do such an analysis when the underlying data is modeled as a graph. In this study, we proposed GARec as a model-based recommender system that applies an attention mechanism along with a spatial GCN on a recommender graph to extract embeddings for users and items. The attention mechanism tells GCN how much a related user or item should affect the final representation of the target entity. We compared the performance of GARec against some baseline algorithms in terms of RMSE. The presented method outperforms existing model-based, non-graph neural networks and graph neural networks in different MovieLens datasets.
    Machine Learning: The Basics. (arXiv:1805.05052v16 [cs.LG] UPDATED)
    (3 min) Machine learning (ML) has become a commodity in our every-day lives. We routinely ask ML empowered smartphones to suggest lovely food places or to guide us through a strange place. ML methods have also become standard tools in many fields of science and engineering. A plethora of ML applications transform human lives at unprecedented pace and scale. This book portrays ML as the combination of three basic components: data, model and loss. ML methods combine these three components within computationally efficient implementations of the basic scientific principle "trial and error". This principle consists of the continuous adaptation of a hypothesis about a phenomenon that generates data. ML methods use a hypothesis to compute predictions for future events. We believe that thinking about ML as combinations of three components given by data, model, and loss helps to navigate the steadily growing offer for ready-to-use ML methods. Our three-component picture of ML allows a unified treatment of a wide range of concepts and techniques which seem quite unrelated at first sight. The regularization effect of early stopping in iterative methods is due to the shrinking of the effective hypothesis space. Privacy-preserving ML is obtained by particular choices for the features of data points. Explainable ML methods are characterized by particular choices for the hypothesis space. To make good use of ML tools it is instrumental to understand its underlying principles at different levels of detail. On a lower level, this tutorial helps ML engineers to choose suitable methods for the application at hand. The book also offers a higher-level view on the implementation of ML methods which is typically required to manage a team of ML engineers and data scientists.
    Eikonal depth: an optimal control approach to statistical depths. (arXiv:2201.05274v1 [math.ST])
    (2 min) Statistical depths provide a fundamental generalization of quantiles and medians to data in higher dimensions. This paper proposes a new type of globally defined statistical depth, based upon control theory and eikonal equations, which measures the smallest amount of probability density that has to be passed through in a path to points outside the support of the distribution: for example spatial infinity. This depth is easy to interpret and compute, expressively captures multi-modal behavior, and extends naturally to data that is non-Euclidean. We prove various properties of this depth, and provide discussion of computational considerations. In particular, we demonstrate that this notion of depth is robust under an aproximate isometrically constrained adversarial model, a property which is not enjoyed by the Tukey depth. Finally we give some illustrative examples in the context of two-dimensional mixture models and MNIST.
    Recombinator-k-means: An evolutionary algorithm that exploits k-means++ for recombination. (arXiv:1905.00531v5 [cs.LG] UPDATED)
    (2 min) We introduce an evolutionary algorithm called recombinator-$k$-means for optimizing the highly non-convex kmeans problem. Its defining feature is that its crossover step involves all the members of the current generation, stochastically recombining them with a repurposed variant of the $k$-means++ seeding algorithm. The recombination also uses a reweighting mechanism that realizes a progressively sharper stochastic selection policy and ensures that the population eventually coalesces into a single solution. We compare this scheme with state-of-the-art alternative, a more standard genetic algorithm with deterministic pairwise-nearest-neighbor crossover and an elitist selection policy, of which we also provide an augmented and efficient implementation. Extensive tests on large and challenging datasets (both synthetic and real-word) show that for fixed population sizes recombinator-$k$-means is generally superior in terms of the optimization objective, at the cost of a more expensive crossover step. When adjusting the population sizes of the two algorithms to match their running times, we find that for short times the (augmented) pairwise-nearest-neighbor method is always superior, while at longer times recombinator-$k$-means will match it and, on the most difficult examples, take over. We conclude that the reweighted whole-population recombination is more costly, but generally better at escaping local minima. Moreover, it is algorithmically simpler and more general (it could be applied even to $k$-medians or $k$-medoids, for example). Our implementations are publicly available at \href{https://github.com/carlobaldassi/RecombinatorKMeans.jl}{https://github.com/carlobaldassi/RecombinatorKMeans.jl}.
    On the Estimation Bias in Double Q-Learning. (arXiv:2109.14419v3 [cs.LG] UPDATED)
    (2 min) Double Q-learning is a classical method for reducing overestimation bias, which is caused by taking maximum estimated values in the Bellman operation. Its variants in the deep Q-learning paradigm have shown great promise in producing reliable value prediction and improving learning performance. However, as shown by prior work, double Q-learning is not fully unbiased and suffers from underestimation bias. In this paper, we show that such underestimation bias may lead to multiple non-optimal fixed points under an approximate Bellman operator. To address the concerns of converging to non-optimal stationary solutions, we propose a simple but effective approach as a partial fix for the underestimation bias in double Q-learning. This approach leverages an approximate dynamic programming to bound the target value. We extensively evaluate our proposed method in the Atari benchmark tasks and demonstrate its significant improvement over baseline algorithms.
    Contrastive Laplacian Eigenmaps. (arXiv:2201.05493v1 [cs.LG])
    (2 min) Graph contrastive learning attracts/disperses node representations for similar/dissimilar node pairs under some notion of similarity. It may be combined with a low-dimensional embedding of nodes to preserve intrinsic and structural properties of a graph. In this paper, we extend the celebrated Laplacian Eigenmaps with contrastive learning, and call them COntrastive Laplacian EigenmapS (COLES). Starting from a GAN-inspired contrastive formulation, we show that the Jensen-Shannon divergence underlying many contrastive graph embedding models fails under disjoint positive and negative distributions, which may naturally emerge during sampling in the contrastive setting. In contrast, we demonstrate analytically that COLES essentially minimizes a surrogate of Wasserstein distance, which is known to cope well under disjoint distributions. Moreover, we show that the loss of COLES belongs to the family of so-called block-contrastive losses, previously shown to be superior compared to pair-wise losses typically used by contrastive methods. We show on popular benchmarks/backbones that COLES offers favourable accuracy/scalability compared to DeepWalk, GCN, Graph2Gauss, DGI and GRACE baselines.
    Structure Enhanced Graph Neural Networks for Link Prediction. (arXiv:2201.05293v1 [cs.LG])
    (2 min) Graph Neural Networks (GNNs) have shown promising results in various tasks, among which link prediction is an important one. GNN models usually follow a node-centric message passing procedure that aggregates the neighborhood information to the central node recursively. Following this paradigm, features of nodes are passed through edges without caring about where the nodes are located and which role they played. However, the neglected topological information is shown to be valuable for link prediction tasks. In this paper, we propose Structure Enhanced Graph neural network (SEG) for link prediction. SEG introduces the path labeling method to capture surrounding topological information of target nodes and then incorporates the structure into an ordinary GNN model. By jointly training the structure encoder and deep GNN model, SEG fuses topological structures and node features to take full advantage of graph information. Experiments on the OGB link prediction datasets demonstrate that SEG achieves state-of-the-art results among all three public datasets.
    The Fairness Field Guide: Perspectives from Social and Formal Sciences. (arXiv:2201.05216v1 [cs.AI])
    (2 min) Over the past several years, a slew of different methods to measure the fairness of a machine learning model have been proposed. However, despite the growing number of publications and implementations, there is still a critical lack of literature that explains the interplay of fair machine learning with the social sciences of philosophy, sociology, and law. We hope to remedy this issue by accumulating and expounding upon the thoughts and discussions of fair machine learning produced by both social and formal (specifically machine learning and statistics) sciences in this field guide. Specifically, in addition to giving the mathematical and algorithmic backgrounds of several popular statistical and causal-based fair machine learning methods, we explain the underlying philosophical and legal thoughts that support them. Further, we explore several criticisms of the current approaches to fair machine learning from sociological and philosophical viewpoints. It is our hope that this field guide will help fair machine learning practitioners better understand how their algorithms align with important humanistic values (such as fairness) and how we can, as a field, design methods and metrics to better serve oppressed and marginalized populaces.
  • stat.ML updates on arXiv.org

    A Method for Controlling Extrapolation when Visualizing and Optimizing the Prediction Profiles of Statistical and Machine Learning Models. (arXiv:2201.05236v1 [stat.ML])
    (2 min) We present a novel method for controlling extrapolation in the prediction profiler in the JMP software. The prediction profiler is a graphical tool for exploring high dimensional prediction surfaces for statistical and machine learning models. The profiler contains interactive cross-sectional views, or profile traces, of the prediction surface of a model. Our method helps users avoid exploring predictions that should be considered extrapolation. It also performs optimization over a constrained factor region that avoids extrapolation using a genetic algorithm. In simulations and real world examples, we demonstrate how optimal factor settings without constraint in the profiler are frequently extrapolated, and how extrapolation control helps avoid these solutions with invalid factor settings that may not be useful to the user.
    The Implicit Regularization of Momentum Gradient Descent with Early Stopping. (arXiv:2201.05405v1 [cs.LG])
    (0 min) The study on the implicit regularization induced by gradient-based optimization is a longstanding pursuit. In the present paper, we characterize the implicit regularization of momentum gradient descent (MGD) with early stopping by comparing with the explicit $\ell_2$-regularization (ridge). In details, we study MGD in the continuous-time view, so-called momentum gradient flow (MGF), and show that its tendency is closer to ridge than the gradient descent (GD) [Ali et al., 2019] for least squares regression. Moreover, we prove that, under the calibration $t=\sqrt{2/\lambda}$, where $t$ is the time parameter in MGF and $\lambda$ is the tuning parameter in ridge regression, the risk of MGF is no more than 1.54 times that of ridge. In particular, the relative Bayes risk of MGF to ridge is between 1 and 1.035 under the optimal tuning. The numerical experiments support our theoretical results strongly.
    On Preference Learning Based on Sequential Bayesian Optimization with Pairwise Comparison. (arXiv:2103.13192v2 [cs.LG] UPDATED)
    (0 min) User preference learning is generally a hard problem. Individual preferences are typically unknown even to users themselves, while the space of choices is infinite. Here we study user preference learning from information-theoretic perspective. We model preference learning as a system with two interacting sub-systems, one representing a user with his/her preferences and another one representing an agent that has to learn these preferences. The user with his/her behaviour is modeled by a parametric preference function. To efficiently learn the preferences and reduce search space quickly, we propose the agent that interacts with the user to collect the most informative data for learning. The agent presents two proposals to the user for evaluation, and the user rates them based on his/her preference function. We show that the optimum agent strategy for data collection and preference learning is a result of maximin optimization of the normalized weighted Kullback-Leibler (KL) divergence between true and agent-assigned predictive user response distributions. The resulting value of KL-divergence, which we also call remaining system uncertainty (RSU), provides an efficient performance metric in the absence of the ground truth. This metric characterises how well the agent can predict user and, thus, the quality of the underlying learned user (preference) model. Our proposed agent comprises sequential mechanisms for user model inference and proposal generation. To infer the user model (preference function), Bayesian approximate inference is used in the agent. The data collection strategy is to generate proposals, responses to which help resolving uncertainty associated with prediction of the user responses the most. The efficiency of our approach is validated by numerical simulations.
    Generalization Error Bounds for Iterative Recovery Algorithms Unfolded as Neural Networks. (arXiv:2112.04364v2 [cs.LG] UPDATED)
    (2 min) Motivated by the learned iterative soft thresholding algorithm (LISTA), we introduce a general class of neural networks suitable for sparse reconstruction from few linear measurements. By allowing a wide range of degrees of weight-sharing between the layers, we enable a unified analysis for very different neural network types, ranging from recurrent ones to networks more similar to standard feedforward neural networks. Based on training samples, via empirical risk minimization we aim at learning the optimal network parameters and thereby the optimal network that reconstructs signals from their low-dimensional linear measurements. We derive generalization bounds by analyzing the Rademacher complexity of hypothesis classes consisting of such deep networks, that also take into account the thresholding parameters. We obtain estimates of the sample complexity that essentially depend only linearly on the number of parameters and on the depth. We apply our main result to obtain specific generalization bounds for several practical examples, including different algorithms for (implicit) dictionary learning, and convolutional neural networks.
    On the Estimation Bias in Double Q-Learning. (arXiv:2109.14419v3 [cs.LG] UPDATED)
    (2 min) Double Q-learning is a classical method for reducing overestimation bias, which is caused by taking maximum estimated values in the Bellman operation. Its variants in the deep Q-learning paradigm have shown great promise in producing reliable value prediction and improving learning performance. However, as shown by prior work, double Q-learning is not fully unbiased and suffers from underestimation bias. In this paper, we show that such underestimation bias may lead to multiple non-optimal fixed points under an approximate Bellman operator. To address the concerns of converging to non-optimal stationary solutions, we propose a simple but effective approach as a partial fix for the underestimation bias in double Q-learning. This approach leverages an approximate dynamic programming to bound the target value. We extensively evaluate our proposed method in the Atari benchmark tasks and demonstrate its significant improvement over baseline algorithms.
    Understanding Negative Samples in Instance Discriminative Self-supervised Representation Learning. (arXiv:2102.06866v5 [cs.LG] UPDATED)
    (2 min) Instance discriminative self-supervised representation learning has been attracted attention thanks to its unsupervised nature and informative feature representation for downstream tasks. In practice, it commonly uses a larger number of negative samples than the number of supervised classes. However, there is an inconsistency in the existing analysis; theoretically, a large number of negative samples degrade classification performance on a downstream supervised task, while empirically, they improve the performance. We provide a novel framework to analyze this empirical result regarding negative samples using the coupon collector's problem. Our bound can implicitly incorporate the supervised loss of the downstream task in the self-supervised loss by increasing the number of negative samples. We confirm that our proposed analysis holds on real-world benchmark datasets.
    Consistent Approximations in Composite Optimization. (arXiv:2201.05250v1 [math.OC])
    (2 min) Approximations of optimization problems arise in computational procedures and sensitivity analysis. The resulting effect on solutions can be significant, with even small approximations of components of a problem translating into large errors in the solutions. We specify conditions under which approximations are well behaved in the sense of minimizers, stationary points, and level-sets and this leads to a framework of consistent approximations. The framework is developed for a broad class of composite problems, which are neither convex nor smooth. We demonstrate the framework using examples from stochastic optimization, neural-network based machine learning, distributionally robust optimization, penalty and augmented Lagrangian methods, interior-point methods, homotopy methods, smoothing methods, extended nonlinear programming, difference-of-convex programming, and multi-objective optimization. An enhanced proximal method illustrates the algorithmic possibilities. A quantitative analysis supplements the development by furnishing rates of convergence.
    Machine Learning: The Basics. (arXiv:1805.05052v16 [cs.LG] UPDATED)
    (3 min) Machine learning (ML) has become a commodity in our every-day lives. We routinely ask ML empowered smartphones to suggest lovely food places or to guide us through a strange place. ML methods have also become standard tools in many fields of science and engineering. A plethora of ML applications transform human lives at unprecedented pace and scale. This book portrays ML as the combination of three basic components: data, model and loss. ML methods combine these three components within computationally efficient implementations of the basic scientific principle "trial and error". This principle consists of the continuous adaptation of a hypothesis about a phenomenon that generates data. ML methods use a hypothesis to compute predictions for future events. We believe that thinking about ML as combinations of three components given by data, model, and loss helps to navigate the steadily growing offer for ready-to-use ML methods. Our three-component picture of ML allows a unified treatment of a wide range of concepts and techniques which seem quite unrelated at first sight. The regularization effect of early stopping in iterative methods is due to the shrinking of the effective hypothesis space. Privacy-preserving ML is obtained by particular choices for the features of data points. Explainable ML methods are characterized by particular choices for the hypothesis space. To make good use of ML tools it is instrumental to understand its underlying principles at different levels of detail. On a lower level, this tutorial helps ML engineers to choose suitable methods for the application at hand. The book also offers a higher-level view on the implementation of ML methods which is typically required to manage a team of ML engineers and data scientists.
    Self-Supervised Learning with Data Augmentations Provably Isolates Content from Style. (arXiv:2106.04619v4 [stat.ML] UPDATED)
    (2 min) Self-supervised representation learning has shown remarkable success in a number of domains. A common practice is to perform data augmentation via hand-crafted transformations intended to leave the semantics of the data invariant. We seek to understand the empirical success of this approach from a theoretical perspective. We formulate the augmentation process as a latent variable model by postulating a partition of the latent representation into a content component, which is assumed invariant to augmentation, and a style component, which is allowed to change. Unlike prior work on disentanglement and independent component analysis, we allow for both nontrivial statistical and causal dependencies in the latent space. We study the identifiability of the latent representation based on pairs of views of the observations and prove sufficient conditions that allow us to identify the invariant content partition up to an invertible mapping in both generative and discriminative settings. We find numerical simulations with dependent latent variables are consistent with our theory. Lastly, we introduce Causal3DIdent, a dataset of high-dimensional, visually complex images with rich causal dependencies, which we use to study the effect of data augmentations performed in practice.
    Estimating Gaussian Copulas with Missing Data. (arXiv:2201.05565v1 [stat.ML])
    (2 min) In this work we present a rigorous application of the Expectation Maximization algorithm to determine the marginal distributions and the dependence structure in a Gaussian copula model with missing data. We further show how to circumvent a priori assumptions on the marginals with semiparametric modelling. The joint distribution learned through this algorithm is considerably closer to the underlying distribution than existing methods.
    Combining No-regret and Q-learning. (arXiv:1910.03094v3 [cs.LG] UPDATED)
    (2 min) Counterfactual Regret Minimization (CFR) has found success in settings like poker which have both terminal states and perfect recall. We seek to understand how to relax these requirements. As a first step, we introduce a simple algorithm, local no-regret learning (LONR), which uses a Q-learning-like update rule to allow learning without terminal states or perfect recall. We prove its convergence for the basic case of MDPs (and limited extensions of them) and present empirical results showing that it achieves last iterate convergence in a number of settings, most notably NoSDE games, a class of Markov games specifically designed to be challenging to learn where no prior algorithm is known to achieve convergence to a stationary equilibrium even on average.
    Hypergraph reconstruction from network data. (arXiv:2008.04948v4 [cs.SI] UPDATED)
    (2 min) Networks can describe the structure of a wide variety of complex systems by specifying which pairs of entities in the system are connected. While such pairwise representations are flexible, they are not necessarily appropriate when the fundamental interactions involve more than two entities at the same time. Pairwise representations nonetheless remain ubiquitous, because higher-order interactions are often not recorded explicitly in network data. Here, we introduce a Bayesian approach to reconstruct latent higher-order interactions from ordinary pairwise network data. Our method is based on the principle of parsimony and only includes higher-order structures when there is sufficient statistical evidence for them. We demonstrate its applicability to a wide range of datasets, both synthetic and empirical.
    Parallel Neural Local Lossless Compression. (arXiv:2201.05213v1 [eess.IV])
    (2 min) The recently proposed Neural Local Lossless Compression (NeLLoC), which is based on a local autoregressive model, has achieved state-of-the-art (SOTA) out-of-distribution (OOD) generalization performance in the image compression task. In addition to the encouragement of OOD generalization, the local model also allows parallel inference in the decoding stage. In this paper, we propose a parallelization scheme for local autoregressive models. We discuss the practicalities of implementing this scheme, and provide experimental evidence of significant gains in compression runtime compared to the previous, non-parallel implementation.
    Bayesian sense of time in biological and artificial brains. (arXiv:2201.05464v1 [q-bio.NC])
    (2 min) Enquiries concerning the underlying mechanisms and the emergent properties of a biological brain have a long history of theoretical postulates and experimental findings. Today, the scientific community tends to converge to a single interpretation of the brain's cognitive underpinnings -- that it is a Bayesian inference machine. This contemporary view has naturally been a strong driving force in recent developments around computational and cognitive neurosciences. Of particular interest is the brain's ability to process the passage of time -- one of the fundamental dimensions of our experience. How can we explain empirical data on human time perception using the Bayesian brain hypothesis? Can we replicate human estimation biases using Bayesian models? What insights can the agent-based machine learning models provide for the study of this subject? In this chapter, we review some of the recent advancements in the field of time perception and discuss the role of Bayesian processing in the construction of temporal models.
    Machine Learning for Multi-Output Regression: When should a holistic multivariate approach be preferred over separate univariate ones?. (arXiv:2201.05340v1 [stat.ML])
    (2 min) Tree-based ensembles such as the Random Forest are modern classics among statistical learning methods. In particular, they are used for predicting univariate responses. In case of multiple outputs the question arises whether we separately fit univariate models or directly follow a multivariate approach. For the latter, several possibilities exist that are, e.g. based on modified splitting or stopping rules for multi-output regression. In this work we compare these methods in extensive simulations to help in answering the primary question when to use multivariate ensemble techniques.
    A novel notion of barycenter for probability distributions based on optimal weak mass transport. (arXiv:2102.13380v3 [stat.ML] UPDATED)
    (2 min) We introduce weak barycenters of a family of probability distributions, based on the recently developed notion of optimal weak transport of mass by Gozlanet al. (2017) and Backhoff-Veraguas et al. (2020). We provide a theoretical analysis of this object and discuss its interpretation in the light of convex ordering between probability measures. In particular, we show that, rather than averaging the input distributions in a geometric way (as the Wasserstein barycenter based on classic optimal transport does) weak barycenters extract common geometric information shared by all the input distributions, encoded as a latent random variable that underlies all of them. We also provide an iterative algorithm to compute a weak barycenter for a finite family of input distributions, and a stochastic algorithm that computes them for arbitrary populations of laws. The latter approach is particularly well suited for the streaming setting, i.e., when distributions are observed sequentially. The notion of weak barycenter and our approaches to compute it are illustrated on synthetic examples, validated on 2D real-world data and compared to standard Wasserstein barycenters.
    Path Regularization: A Convexity and Sparsity Inducing Regularization for Parallel ReLU Networks. (arXiv:2110.09548v3 [cs.LG] UPDATED)
    (2 min) Understanding the fundamental principles behind the success of deep neural networks is one of the most important open questions in the current literature. To this end, we study the training problem of deep neural networks and introduce an analytic approach to unveil hidden convexity in the optimization landscape. We consider a deep parallel ReLU network architecture, which also includes standard deep networks and ResNets as its special cases. We then show that pathwise regularized training problems can be represented as an exact convex optimization problem. We further prove that the equivalent convex problem is regularized via a group sparsity inducing norm. Thus, a path regularized parallel ReLU network can be viewed as a parsimonious convex model in high dimensions. More importantly, we show that the computational complexity required to globally optimize the equivalent convex problem is fully polynomial-time in feature dimension and number of samples. Therefore, we prove polynomial-time trainability of path regularized ReLU networks with global optimality guarantees. We also provide several numerical experiments corroborating our theory.
    Use of High Dimensional Modeling for automatic variables selection: the best path algorithm. (arXiv:2105.03173v2 [stat.ML] UPDATED)
    (2 min) This paper presents a new algorithm for automatic variables selection. In particular, using the Graphical Models properties it is possible to develop a method that can be used in the contest of large dataset. The advantage of this algorithm is that can be combined with different forecasting models. In this research we have used the OLS method and we have compared the result with the LASSO method.
    KrigHedge: Gaussian Process Surrogates for Delta Hedging. (arXiv:2010.08407v4 [q-fin.CP] UPDATED)
    (2 min) We investigate a machine learning approach to option Greeks approximation based on Gaussian process (GP) surrogates. The method takes in noisily observed option prices, fits a nonparametric input-output map and then analytically differentiates the latter to obtain the various price sensitivities. Our motivation is to compute Greeks in cases where direct computation is expensive, such as in local volatility models, or can only ever be done approximately. We provide a detailed analysis of numerous aspects of GP surrogates, including choice of kernel family, simulation design, choice of trend function and impact of noise. We further discuss the application to Delta hedging, including a new Lemma that relates quality of the Delta approximation to discrete-time hedging loss. Results are illustrated with two extensive case studies that consider estimation of Delta, Theta and Gamma and benchmark approximation quality and uncertainty quantification using a variety of statistical metrics. Among our key take-aways are the recommendation to use Matern kernels, the benefit of including virtual training points to capture boundary conditions, and the significant loss of fidelity when training on stock-path-based datasets.
    Recombinator-k-means: An evolutionary algorithm that exploits k-means++ for recombination. (arXiv:1905.00531v5 [cs.LG] UPDATED)
    (2 min) We introduce an evolutionary algorithm called recombinator-$k$-means for optimizing the highly non-convex kmeans problem. Its defining feature is that its crossover step involves all the members of the current generation, stochastically recombining them with a repurposed variant of the $k$-means++ seeding algorithm. The recombination also uses a reweighting mechanism that realizes a progressively sharper stochastic selection policy and ensures that the population eventually coalesces into a single solution. We compare this scheme with state-of-the-art alternative, a more standard genetic algorithm with deterministic pairwise-nearest-neighbor crossover and an elitist selection policy, of which we also provide an augmented and efficient implementation. Extensive tests on large and challenging datasets (both synthetic and real-word) show that for fixed population sizes recombinator-$k$-means is generally superior in terms of the optimization objective, at the cost of a more expensive crossover step. When adjusting the population sizes of the two algorithms to match their running times, we find that for short times the (augmented) pairwise-nearest-neighbor method is always superior, while at longer times recombinator-$k$-means will match it and, on the most difficult examples, take over. We conclude that the reweighted whole-population recombination is more costly, but generally better at escaping local minima. Moreover, it is algorithmically simpler and more general (it could be applied even to $k$-medians or $k$-medoids, for example). Our implementations are publicly available at \href{https://github.com/carlobaldassi/RecombinatorKMeans.jl}{https://github.com/carlobaldassi/RecombinatorKMeans.jl}.
    Eikonal depth: an optimal control approach to statistical depths. (arXiv:2201.05274v1 [math.ST])
    (2 min) Statistical depths provide a fundamental generalization of quantiles and medians to data in higher dimensions. This paper proposes a new type of globally defined statistical depth, based upon control theory and eikonal equations, which measures the smallest amount of probability density that has to be passed through in a path to points outside the support of the distribution: for example spatial infinity. This depth is easy to interpret and compute, expressively captures multi-modal behavior, and extends naturally to data that is non-Euclidean. We prove various properties of this depth, and provide discussion of computational considerations. In particular, we demonstrate that this notion of depth is robust under an aproximate isometrically constrained adversarial model, a property which is not enjoyed by the Tukey depth. Finally we give some illustrative examples in the context of two-dimensional mixture models and MNIST.
    A Deep Generative Model for Molecule Optimization via One Fragment Modification. (arXiv:2012.04231v5 [cs.LG] UPDATED)
    (2 min) Molecule optimization is a critical step in drug development to improve desired properties of drug candidates through chemical modification. We developed a novel deep generative model Modof over molecular graphs for molecule optimization. Modof modifies a given molecule through the prediction of a single site of disconnection at the molecule and the removal and/or addition of fragments at that site. A pipeline of multiple, identical Modof models is implemented into Modof-pipe to modify an input molecule at multiple disconnection sites. Here we show that Modof-pipe is able to retain major molecular scaffolds, allow controls over intermediate optimization steps and better constrain molecule similarities. Modof-pipe outperforms the state-of-the-art methods on benchmark datasets: without molecular similarity constraints, Modof-pipe achieves 81.2% improvement in octanol-water partition coefficient penalized by synthetic accessibility and ring size; and 51.2%, 25.6% and 9.2% improvement if the optimized molecules are at least 0.2, 0.4 and 0.6 similar to those before optimization, respectively. Modof-pipe is further enhanced into Modof-pipem to allow modifying one molecule to multiple optimized ones. Modof-pipem achieves additional performance improvement as at least 17.8% better than Modof-pipe.
    Leveraging Intrinsic Gradient Information for Further Training of Differentiable Machine Learning Models. (arXiv:2112.00094v2 [cs.LG] UPDATED)
    (2 min) Designing models that produce accurate predictions is the fundamental objective of machine learning (ML). This work presents methods demonstrating that when the derivatives of target variables (outputs) with respect to inputs can be extracted from processes of interest, e.g., neural networks (NN) based surrogate models, they can be leveraged to further improve the accuracy of differentiable ML models. This paper generalises the idea and provides practical methodologies that can be used to leverage gradient information (GI) across a variety of applications including: (1) Improving the performance of generative adversarial networks (GANs); (2) efficiently tuning NN model complexity; (3) regularising linear regressions. Numerical results show that GI can effective enhance ML models with existing datasets, demonstrating its value for a variety of applications.
    E(n) Equivariant Normalizing Flows. (arXiv:2105.09016v4 [cs.LG] UPDATED)
    (2 min) This paper introduces a generative model equivariant to Euclidean symmetries: E(n) Equivariant Normalizing Flows (E-NFs). To construct E-NFs, we take the discriminative E(n) graph neural networks and integrate them as a differential equation to obtain an invertible equivariant function: a continuous-time normalizing flow. We demonstrate that E-NFs considerably outperform baselines and existing methods from the literature on particle systems such as DW4 and LJ13, and on molecules from QM9 in terms of log-likelihood. To the best of our knowledge, this is the first flow that jointly generates molecule features and positions in 3D.
    $\ell_1$-norm constrained multi-block sparse canonical correlation analysis via proximal gradient descent. (arXiv:2201.05289v1 [stat.ME])
    (2 min) Multi-block CCA constructs linear relationships explaining coherent variations across multiple blocks of data. We view the multi-block CCA problem as finding leading generalized eigenvectors and propose to solve it via a proximal gradient descent algorithm with $\ell_1$ constraint for high dimensional data. In particular, we use a decaying sequence of constraints over proximal iterations, and show that the resulting estimate is rate-optimal under suitable assumptions. Although several previous works have demonstrated such optimality for the $\ell_0$ constrained problem using iterative approaches, the same level of theoretical understanding for the $\ell_1$ constrained formulation is still lacking. We also describe an easy-to-implement deflation procedure to estimate multiple eigenvectors sequentially. We compare our proposals to several existing methods whose implementations are available on R CRAN, and the proposed methods show competitive performances in both simulations and a real data example.
    Importance sampling based active learning for parametric seismic fragility curve estimation. (arXiv:2109.04323v2 [stat.ML] UPDATED)
    (2 min) The key elements of seismic probabilistic risk assessment studies are the fragility curves which express the probabilities of failure of structures conditional to a seismic intensity measure. A multitude of procedures is currently available to estimate these curves. For modeling-based approaches which may involve complex and expensive numerical models, the main challenge is to optimize the calls to the numerical codes to reduce the estimation costs. Adaptive techniques can be used for this purpose, but in doing so, taking into account the uncertainties of the estimates (via confidence intervals or ellipsoids related to the size of the samples used) is an arduous task because the samples are no longer independent and possibly not identically distributed. The main contribution of this work is to deal with this question in a mathematical and rigorous way. To this end, we propose and implement an active learning methodology based on adaptive importance sampling for parametric estimations of fragility curves. We prove some theoretical properties (consistency and asymptotic normality) for the estimator of interest. Moreover, we give a convergence criterion in order to use asymptotic confidence ellipsoids. Finally, the performances of the methodology are evaluated on analytical and industrial test cases of increasing complexity.
    A Kernel-Expanded Stochastic Neural Network. (arXiv:2201.05319v1 [stat.ML])
    (2 min) The deep neural network suffers from many fundamental issues in machine learning. For example, it often gets trapped into a local minimum in training, and its prediction uncertainty is hard to be assessed. To address these issues, we propose the so-called kernel-expanded stochastic neural network (K-StoNet) model, which incorporates support vector regression (SVR) as the first hidden layer and reformulates the neural network as a latent variable model. The former maps the input vector into an infinite dimensional feature space via a radial basis function (RBF) kernel, ensuring absence of local minima on its training loss surface. The latter breaks the high-dimensional nonconvex neural network training problem into a series of low-dimensional convex optimization problems, and enables its prediction uncertainty easily assessed. The K-StoNet can be easily trained using the imputation-regularized optimization (IRO) algorithm. Compared to traditional deep neural networks, K-StoNet possesses a theoretical guarantee to asymptotically converge to the global optimum and enables the prediction uncertainty easily assessed. The performances of the new model in training, prediction and uncertainty quantification are illustrated by simulated and real data examples.
    Data Fusion with Latent Map Gaussian Processes. (arXiv:2112.02206v2 [stat.ML] UPDATED)
    (2 min) Multi-fidelity modeling and calibration are data fusion tasks that ubiquitously arise in engineering design. In this paper, we introduce a novel approach based on latent-map Gaussian processes (LMGPs) that enables efficient and accurate data fusion. In our approach, we convert data fusion into a latent space learning problem where the relations among different data sources are automatically learned. This conversion endows our approach with attractive advantages such as increased accuracy, reduced costs, flexibility to jointly fuse any number of data sources, and ability to visualize correlations between data sources. This visualization allows the user to detect model form errors or determine the optimum strategy for high-fidelity emulation by fitting LMGP only to the subset of the data sources that are well-correlated. We also develop a new kernel function that enables LMGPs to not only build a probabilistic multi-fidelity surrogate but also estimate calibration parameters with high accuracy and consistency. The implementation and use of our approach are considerably simpler and less prone to numerical issues compared to existing technologies. We demonstrate the benefits of LMGP-based data fusion by comparing its performance against competing methods on a wide range of examples.
    Deep Learning for Agile Effort Estimation Have We Solved the Problem Yet?. (arXiv:2201.05401v1 [cs.SE])
    (2 min) In the last decade, several studies have proposed the use of automated techniques to estimate the effort of agile software development. In this paper we perform a close replication and extension of a seminal work proposing the use of Deep Learning for agile effort estimation (namely Deep-SE), which has set the state-of-the-art since. Specifically, we replicate three of the original research questions aiming at investigating the effectiveness of Deep-SE for both within-project and cross-project effort estimation. We benchmark Deep-SE against three baseline techniques (i.e., Random, Mean and Median effort prediction) and a previously proposed method to estimate agile software project development effort (dubbed TF/IDF-SE), as done in the original study. To this end, we use both the data from the original study and a new larger dataset of 31,960 issues, which we mined from 29 open-source projects. Using more data allows us to strengthen our confidence in the results and further mitigate the threat to the external validity of the study. We also extend the original study by investigating two additional research questions. One evaluates the accuracy of Deep-SE when the training set is augmented with issues from all other projects available in the repository at the time of estimation, and the other examines whether an expensive pre-training step used by the original Deep-SE, has any beneficial effect on its accuracy and convergence speed. The results of our replication show that Deep-SE outperforms the Median baseline estimator and TF/IDF-SE in only very few cases with statistical significance (8/42 and 9/32 cases, respectively), thus confounding previous findings on the efficacy of Deep-SE. The two additional RQs revealed that neither augmenting the training set nor pre-training Deep-SE play a role in improving its accuracy and convergence speed. ...
    A causal view on compositional data. (arXiv:2106.11234v2 [cs.LG] UPDATED)
    (2 min) Many scientific datasets are compositional in nature. Important examples include species abundances in ecology, rock compositions in geology, topic compositions in large-scale text corpora, and sequencing count data in molecular biology. Here, we provide a causal view on compositional data in an instrumental variable setting where the composition acts as the cause. First, we crisply articulate potential pitfalls for practitioners regarding the interpretation of compositional causes from the viewpoint of interventions and warn against attributing causal meaning to common summary statistics such as diversity indices. We then advocate for and develop multivariate methods using statistical data transformations and regression techniques that take the special structure of the compositional sample space into account. In a comparative analysis on synthetic and real data we show the advantages and limitations of our proposal. We posit that our framework provides a useful starting point and guidance for valid and informative cause-effect estimation in the context of compositional data.
  • Google AI Blog

    Introducing StylEx: A New Approach for Visual Explanation of Classifiers
    (8 min) Posted by Oran Lang and Inbar Mosseri, Software Engineers, Google Research Neural networks can perform certain tasks remarkably well, but understanding how they reach their decisions — e.g., identifying which signals in an image cause a model to determine it to be of one class and not another — is often a mystery. Explaining a neural model’s decision process may have high social impact in certain areas, such as analysis of medical images and autonomous driving, where human oversight is critical. These insights can also be helpful in guiding health care providers, revealing model biases, providing support for downstream decision makers, and even aiding scientific discovery. Previous approaches for visual explanations of classifiers, such as attention maps (e.g., Grad-CAM), highlight whic…
  • Data Science Central

    Internet of Robotic Things : Robotics and Intelligence Evolving to Make the New Era
    (3 min) Robot can be integrated as an entity in the Internet of Things (IoT) infrastructure thereby enabling connections between different entities using diverse communication protocols. Before we delve into the market, here’s a quick primer on the phenomenon – Internet of Robotic Things refers to robot connected as a thing to IoT technology which builds connections with other… Read More »Internet of Robotic Things : Robotics and Intelligence Evolving to Make the New Era The post Internet of Robotic Things : Robotics and Intelligence Evolving to Make the New Era appeared first on Data Science Central.
    5 Features Of Employee Scheduling Software For Your Team
    (4 min) Successful organizations rely on employee scheduling software features to simplify schedule changes, boost employee retention, and keep everyone running on the same page. In fact, over 60% of small businesses experience faster, more accurate, and easier scheduling using advanced software solutions. This software has become a valuable tool to plan hours of operations, job duties,… Read More »5 Features Of Employee Scheduling Software For Your Team The post 5 Features Of Employee Scheduling Software For Your Team appeared first on Data Science Central.
  • The Official NVIDIA Blog

    Billions Served: NVIDIA Merlin Helps Fuel Clicks for Online Giants
    (4 min) Online commerce has rocketed to trillions of dollars worldwide in the past decade, serving billions of consumers. Behind the scenes of this explosive growth in online sales is personalization driven by recommender engines. Recommenders make shopping deeply personalized. While searching for products on e-commerce sites, they find you. Or suggestions can just appear. This wildly Read article > The post Billions Served: NVIDIA Merlin Helps Fuel Clicks for Online Giants appeared first on The Official NVIDIA Blog.
  • MIT News - Artificial intelligence

    How well do explanation methods for machine-learning models work?
    (7 min) Researchers develop a way to test whether popular methods for understanding machine-learning models are working correctly.
  • AWS Machine Learning Blog

    Detect NLP data drift using custom Amazon SageMaker Model Monitor
    (11 min) Natural language understanding is applied in a wide range of use cases, from chatbots and virtual assistants, to machine translation and text summarization. To ensure that these applications are running at an expected level of performance, it’s important that data in the training and production environments is from the same distribution. When the data that […]
    Computer vision-based anomaly detection using Amazon Lookout for Vision and AWS Panorama
    (9 min) This is the second post in the two-part series on how Tyson Foods Inc., is using computer vision applications at the edge to automate industrial processes inside their meat processing plants. In Part 1, we discussed an inventory counting application at packaging lines built with Amazon SageMaker and AWS Panorama . In this post, we […]
  • Becoming Human: Artificial Intelligence Magazine - Medium

    How Unidentified Testing Can Help Find Bugs That Aren’t Immediately Visible?
    (3 min) Even when you believe you have everything perfect, non-functional testing might discover problems that were previously undetected.
    How the Future of the Recruitment Industry Is Going to Shape Out
    (3 min) The Recruitment Industry is going to witness a massive revolution in the coming decade.
    5 algorithms one must know to start with Quantum Computing.
    (5 min) “Life is strong and fragile. It’s a paradox… It’s both things, like quantum physics: It’s a particle and a wave at the same time. It all…
  • Artificial Intelligence

    Researchers Introduce High-Performance Deep Learning Toolbox for Genome-Scale Prediction of Protein Structure and Function
    (1 min) Gene functional annotation refers to correctly inferring a gene’s function from its sequence, a critical role in the biological sciences. The exponential expansion in the number of sequenced genomes has been fueled by dramatic breakthroughs in next-generation sequencing technologies, resulting in an increasing bottleneck of incorrect gene annotation. Computational techniques can help eliminate the gene-annotation bottleneck by taking advantage of this vast amount of data. Researchers from the Georgia Institute of Technology and the Department of Energy’s Oak Ridge National Laboratory are using supercomputing and new deep learning technologies to anticipate the structures and functions of thousands of proteins with unknown activities. The team used the summit supercomputer to accurately identify protein structures and activities across entire genomes of species. Their deep learning-based algorithms predict protein structure and function from DNA sequences, speeding up discoveries that could help enhance biotechnology, biosecurity, bioenergy, and pollution and climate change solutions. Continue Reading Paper: https://ieeexplore.ieee.org/document/9652872 submitted by /u/techsucker [link] [comments]
    How to Choose the Grid Size and Block Size for a CUDA Kernel?
    (1 min) submitted by /u/Just0by [link] [comments]
    Face Recognition During Coronavirus: Findings From the Experiments
    (0 min) With the arrival of the COVID-19 pandemic, the industry has started to question the effectiveness of face recognition on masked faces. We left the assumptions behind and conducted several experiments using CompreFace. https://exadel.com/news/face-recognition-during-coronavirus-do-masks-block-facial-recognition submitted by /u/lklimusheuskaja [link] [comments]
    Nice Explanation: Artificial Intelligence (AI) explained in 2 minutes
    (0 min) submitted by /u/Philo167 [link] [comments]
    How is AI Simplifying Software Development for Companies And Coders?
    (0 min) submitted by /u/noahthearc333 [link] [comments]
    My Sona Original Character Art
    (0 min) submitted by /u/VIRUS-AOTOXIN [link] [comments]
    Weekly China AI: Top 10 Tech Trends for 2022; How to Make Composite Image Look Authentic? Add Shadows; New Chinese Benchmark Tests NLU & NLG Models
    (1 min) submitted by /u/trcytony [link] [comments]
    Who was the first AI villain conceived?
    (3 min) For some reason I'm obsessed with AI villains. The train of thought the writers have to ride to come up with them seems to put them on a flow that creates revolutionary pieces of art, be it Ghost in the shell (inspired the Matrix!), System Shock, or 2001: A Space odyssey, to name a few. I'm curious, who would be the first AI villain conceived? I tried googling it, but to no avail. submitted by /u/Zunge [link] [comments]
    Building an Expert System
    (1 min) Seeking views on whether the GPT family is sufficiently mature to be used as the core components of an expert system (knowledge base plus inference engine), or would it be easier to go with the conventional knowledge graph and rule based inference engine? Thanks! submitted by /u/minisoo [link] [comments]
    Artificial Nightmares: Chesire Cat Fields || Pytti VQGAN AI Art Video [4K 60 FPS]
    (1 min) submitted by /u/Thenamessd [link] [comments]
  • Machine Learning

    [N] Competition on Meta-learning from Learning Curves (Prizes: 1,000$) - 3 weeks left
    (1 min) ​ https://metalearning.chalearn.org/ Link to our competition: https://codalab.lisn.upsaclay.fr/competitions/753 Prizes: 1,000$ Introduction video: https://youtu.be/1rLc9Z0vRzw Medium article: https://medium.com/@hungnm.vnu/meta-learning-from-learning-curves-ieee-wcci-2022-competition-5e1932742644 Artificial learning systems are good at learning to self-drive cars, to assist doctors to diagnose disease, to play video games, to recognize faces, to translate languages, but do poorly across many tasks. Recently, more effort has been put to recycle knowledge acquired by learning a task (e.g. recognizing objects), to learn faster and better a new task (e.g. recognizing faces), typically re-using or fine-tuning learned representations. BUT IS IT POSSIBLE TO LEARN-TO-LEARN? Namely develo…
    [D] Google is killing the Data Scientist
    (1 min) So I've noticed that AutoML is reducing the need for Data Scientists that build models. Shoving data into models can be done by anywho who knows SQL (data analyst, data engineer) and AutoML is MASSIVELY out-performing teams of data scientists. Do you think data science is overhyped right now? Do you think there's a need to build a train models when AutoML can do it better and cheaper? See: https://ilovedata.medium.com/how-google-is-killing-the-data-scientist-94e641c98e2e submitted by /u/Flat_Shower [link] [comments]
    [D] Do the advances in machine learning mean the dawn of theory-free science?
    (2 min) An interesting paper by Laura Spinney in the Guardian about the advent of post-theory or theory-free science, based only on empirical data and machine learning, particularly deep learning. She revisits the famous Chris Anderson's paper of 2008: "The end of theory" published in Wired. submitted by /u/ClaudeCoulombe [link] [comments]
    [D] Mathematics research towards machine learning
    (1 min) Hello everyone. I'm a soon to graduate undergrad student, and recently been thinking about pursuing a master's degree. I'm really interested in formal and advanced mathematics, particularly in probabilit theory, probability optimization and stochastic calculus. After my master's I still plan on working with ML for a while and wanted to know if these research areas are somewhat useful in machine learning and if I could have a research topic that i'm really interested in. submitted by /u/Bira-of-louders [link] [comments]
    [P] FFCV: Accelerated Model Training via Fast Data Loading
    (2 min) Hi r/MachineLearning! Today we released FFCV (ffcv.io), a library that speeds up machine learning model training by accelerating the data loading and processing pipeline. With FFCV we were able to: Train an ImageNet model (ResNet-18) to 66.9% accuracy *on one GPU* in 35 minutes (98¢/model on AWS) Train an ImageNet model (ResNet-50) to 75.6% accuracy *on one AWS machine* in 20 minutes ($4.5/model on AWS) Train a CIFAR-10 model on one GPU in 36 seconds (2¢/model on AWS) The best part about FFCV is how easy it is to use: chances are you can make your training code significantly faster by converting your dataset to FFCV format and changing just a few lines of code! To illustrate just how easy it is, we also made a minimal ImageNet example that gets high accuracies at SOTA speeds: https://github.com/libffcv/ffcv-imagenet. Let us know what you think! submitted by /u/andrew_ilyas [link] [comments]
    [P] Speech Separation Algorithm
    (1 min) I'm working on using an audio wav file as input that would have 2 to 3 people conversing. I have to identify the two different voices. Separate 2 voices. Convert these voices into text and have their transcripts stored. I have gone through huggingface transformers such a web2vec2 and svoice from Facebook research but I find It difficult to implement the models. Can someone guide me on approaching such tasks as I am being a beginner in audio domain of deep learning. submitted by /u/Yagna24 [link] [comments]
    [P] How do we account for ‘confounding due to indication’ while training a model?
    (2 min) Hello community, I have been a newbie to machine learning and recently got involved with building models in health outcomes research. We have a risk prediction and classification model that we trained initially using logistics regression. We found that a very important variable is significantly positively associated with outcome but in reality it should be negatively associated. We then figured the reason as it was a confounded by indication. We than eliminated the variable and then employed lasso, build a generalized boosted and random forest models. My concern is, given that this is one of the most important variable for our model from usability pov I am not very certain of dropping it away. Is there a way we can account for this confounding effect and include the variable in other algorithms? This is a retrospective big data that we are working with. Any suggestions or direction to any literature is helpful. Thank you! submitted by /u/HealthtoML [link] [comments]
    [R] Instance-segmentation combined with regression per instance
    (1 min) Hi there. For a project I'm looking into adding a regression head to a Mask R-CNN model. The aim is to predict a continuous variable for each of the detected instances. An example could be to predict the distance to the camera for each of the detected pedestrians. Or the age, weight or height. I wonder if there is any research into this. And what the name of the problem statement or related field of research is. I've already did a google scholar search, and read through all of the paperswithcode methods but could not find anything remotely like this. Most of the results relate to the regression of the bounding boxes, and not to a regression output variable. Keywords to search for, suggested readings, research groups, or anything that touches on this subject is much appreciated. submitted by /u/pietermarsman [link] [comments]
    "None of above" problem (Image Classification) [D]
    (1 min) Hello, I would like to know how do you handle the "None of above" problem in multi-class image classification. (None of above problem: If the image does not match any of the classes ) Thank you submitted by /u/CommercialWest7683 [link] [comments]
    [P] Choosing a self-supervised learning framework that's easy to use
    (1 min) I need to pretrain a spatial image classification model using self-supervised learning and would like to find the best/easiest to use implementation of some SOTA. There are a couple of factors: Tensorflow is preferred, but if there are great implementations in Pytorch, that is also fine Training on multiple GPUs needs to be supported Also, these likely don't impact the framework selection, but worth noting: ResNet will be used as the (initial) backbone The dataset is huge, 1B images can be used if necessary (if dataset size is a factor when choosing the implementation) Do you have any recommendations/links to code that can be used? Frameworks I'm initially considering are having issues: SimCLR -can be trained only on one GPU. Any known workarounds for that? BYOL - again, it seems that it's not optimized for running on multiple GPUs. submitted by /u/alkibijad [link] [comments]
    [D] Visualizing attention
    (0 min) How do you visualize attention especially when there are many tokens (256,512) with multiple layers and multiple heads? Most visualizations and frameworks I’ve tried fail when there are more than 100 tokens. This is quite important since most nlp and vision models now use multi head attention. submitted by /u/mlvpj [link] [comments]
    [N] We just got funded for an open-source project to make Metric Learning practical.
    (1 min) Hello all 👋 We are developing Qdrant, an open-source neural search engine written in Rust https://github.com/qdrant/qdrant. We are happy to announce that we have just closed our initial funding round to build our technology further and make Metric Learning practical. 🚀 Reddit and the community played an important role. Thank you! 🙏 Our vector similarity engine provides a production-ready service with a convenient API to store, search, and manage vectors (embeddings) along with the additional payload. Qdrant is tailored to extended filtering support, making it useful for all sorts of neural-network or semantic-based matching, recommendations, faceted search, and other applications. Feedback welcome! submitted by /u/devzaya [link] [comments]
    [D] I am working on streamlining computer vision data annotation and Nvidia's Transfer Learning Toolkit (TLT)
    (1 min) I am working on a project to integrate a data annotation and management tool and TLT and make a UI for it. Currently I am in brainstorming phase. I would like to know is there any such solutions out there which I can refer as a benchmark. If not is there any thoughts on how to develop such tool. Any inputs are appreciated. submitted by /u/Ahmad401 [link] [comments]
    [P] I trained every single SOTA from 2021 and accidentally got a silver medal on Kaggle
    (3 min) ![](https://i.ibb.co/gwpJXBm/lb.png) I trained every single SOTA model from 2021 and accidentally got a silver medal on an image classification competition on Kaggle recently (Pawpularity Contest). Here If you are interested The idea was to train every SOTA and then Nuke the leaderboard with 10 Billion parameters ensemble of ensembles. Some ensembles were also supplemented a bit with catboost 2nd stage model just for the "why not". Outline of the approach: https://i.ibb.co/McJ39mW/image-nuke.png This stunt was done mainly for the purpose me catching up with the current most recent SOTA vision papers. I seriously didn't try to compete on the leaderboard and never had the intention of releasing a public notebook that actually gets a silver medal. This came as a complete surprise to me! Hope the solution will be useful for many others in the future. If you got any questions or feedback, I'll be more than happy to discuss them! submitted by /u/yamqwe [link] [comments]
    Any ML engineers here that have degrees from unrelated fields? [D]
    (5 min) Been going back and forth on mastering out from my PhD program in mechanical engineering. Research mostly involves ML though. I want to be an ML engineer or ML research scientist (not sure yet), and I don’t know whether it’d be better to continue my PhD for another 2-3 years or whether to master out after this summer and then start working in an ML related role. Although, I’d probably have to take some months first to self-study some more / do more personal projects to add to my portfolio. So I’m wondering if anyone in this sub got a PhD in an unrelated field & managed to land an ML role that pays well. Would love to hear your story. Thanks! submitted by /u/Careless_Drummer_208 [link] [comments]
    [Research] Regression problems where the input shape may change.
    (2 min) Specifically I'm working on an applied ML problem where I need to create a surrogate model for optimization of just one feature. Imagine my input data is a 2D physical plane with a vector field superimposed. If the vector field is highly divergent then I get a big output number, Omega. I care more that my predicted Omega has a high R Squared value than absolute error. Now I've been able to get a CNN to work for a simple 2D grid with reasonable results. But I need my model to be able to handle the case where the input shape isn't always a simple 2D plane. The model needs to be able to handle holes in the physical layer the vector field is super imposed on. As in some times there will and won't be holes, and the holes can move. I've briefly looked into Spatial Transformer Networks but its mostly centered on Computer Vision. So much research in this area is for computer vision, and I am playing Research Paper Hop-Scotch trying to find something to help me. Does anyone has any recommendation of current research? Papers or key words to search and find. Thank you in advance. Seriously though I wish there was an ML community that focused on everything ML but Computer Vision. submitted by /u/Slowbutstrong [link] [comments]
  • Neural Networks, Deep Learning and Machine Learning

    Feedforward delay of deep neural networks
    (1 min) I'm wondering if exciting a DNN (ie. feed-forwarding the values of each layer) would ever be a problem on a hardware-limited platform like a small drone. For a time-sensitive application like using a DNN as a controller, how large would the network have to be for the feed-forward process to impose considerable issues in terms of time delay? Have there been any studies exploring this concept? submitted by /u/Cogitarius [link] [comments]
    Looking for a Cyclist Detector or a Database containing images of cyclists.
    (1 min) TLDR; Need help finding a Cyclist Detector or a Database containing images of cyclists. Hello! I have a camera set up looking out a second-story window at the street and would like to count the number of cyclists passing by. I'm using the OpenVINO tech stack to do the detection, but their 'person-vehicle-bike-detection-crossroad-1016' has atrocious accuracy for cyclists. (For reference: https://docs.openvino.ai/2021.2/omz_models_intel_person_vehicle_bike_detection_crossroad_1016_description_person_vehicle_bike_detection_crossroad_1016.html ) I'm asking for a little help in case someone has knowledge on the subject. (My google-fu has been unable to find what I need on this occasion.) If anyone would kindly provide information on a cyclist detection algorithm in Caffe, TensorFlow, Kaldi, or MXNet; I could use the OpenVINO model optimizer to turn them into Intermediate Representations and get it working with my setup. Likewise, the longer route would be to train my own model. However, as many of you would know, getting a proper dataset is the hardest step to get started. If anyone knows an appropriate data set I would be happy to take a look to see if it's suitable. I appreciate your time and suggestions on this matter. submitted by /u/Naviance_Tartico [link] [comments]
  • Reinforcement Learning

    Is optimal action sequence problem in a finite stochastic MDP undecidable?
    (1 min) We know that an optimal policy can be solved for a finite MDP even if it's stochastic. But what if the question is, given a state s_t, what is the optimal action sequence of length k, i.e., {a_t, ..., a_t+k-1}? We know a_t from the optimal policy, but the best we can know about s_t+1 is a probability distribution, which makes it essentially a belief state in a POMDP. Since planning with belief states in POMDPs are generally undecidable, then optimal action sequence problem in a finite stochastic MDP is also undecidable. Or am I wrong? submitted by /u/ritiange [link] [comments]
    What are the best algorithms for team games (3v3, 4v4, 5v5)?
    (1 min) I want to train bots for some team game where each team will have 3 to 5 teammates. Enviroment is partially observable. What are the best algorithms for that kind of an enviroment? submitted by /u/TheGuy839 [link] [comments]
    Applications of RL in instructional sequencing?
    (1 min) Hi all! I'm new here (both to reddit and to the RL world) and very happy to join the community! I am interested in developing algorithms for adaptive learning in education (by adaptive learning I mean algorithms that help students define their own learning path through some sort of educational platform), and I'd like to know if any of you have heard about using RL to that end. I've read some sources where they mention the use of MDPs and POMDPs for instructional sequencing (check this one for instance), but I'm not sure if this subarea has developed any further since. The reason why I think RL might be interesting to me is that eventually, I'd like to work in an algorithm that delivers a collaborative instructional sequence for a group of students. That is, given a bunch of students with a common goal (e.g. doing some teamwork), output an optimal sequence of concepts to study and exercises to solve, so that each one of them passes the subject and such that the group benefits from the individual skills as much as possible. If I base my adaptive learning algorithms on RL, then I could extend it with these collaborative features quite naturally using MARL... I guess that my question here is: does any of this make any sense to you? xd P.S: we are talking about a study and development period of 3-4 year submitted by /u/aloaberasturi [link] [comments]
    Can anyone explain how this happened ? (linear function approximation)
    (0 min) ​ https://preview.redd.it/vlm36ckqbhc81.png?width=940&format=png&auto=webp&s=06ee88e19790f7dc96438e63d034f8b18fe2f045 submitted by /u/Whole_Run_4485 [link] [comments]
    Do softmax values of policy gradient keep the order of values for each action
    (1 min) Assume I train a softmax policy with policy gradient. At test time, I input one state and I get the softmax values for the n actions z_1, z_2, z_3, ... z_n with, for example, z_1 > z_2 > z_3 ... z_n. Can I expect then that the Q values have the same order, i.e. Q(s, a_1) > Q(s, a_2) > ... > Q(s, a_n)? submitted by /u/fedetask [link] [comments]
    Advice needed on RL problem definition. (stochastic, time-varying environment)
    (1 min) Hi everyone! Recently, I have been working on a problem that I want to solve with reinforcement learning. However, I have some doubts about what would be the right way of defining the problem in an RL framework. This is what I have so far: The actions space is quite simple; the agent has a choice of 4 possible (fixed) actions. The state space and environment is where I get a bit stuck. The environment is stochastic and time-varying. A realization of this environment is available as a state. So what is observed of the environment is given as a state. (call this part of the state; state_1) Since the environment is stochastic, the state transitions are stochastic. Also, due to the time-variability, the shape of the distribution changes with time. Another factor is added to the state definition as well (state_2). This factor is actually a (deterministic) function of the state and chosen action. So it is directly related to the chosen action in a state. Furthermore, it is related to the previous value of state_2. As a reward, a factor based on the performance of the action given the state is defined. I think this would be hard for the agent to solve, since it only sees realizations of the environment. Another thought I have is to change the definition of state_1 to the environment distribution parameters of the current time. That way, the agent can make decisions based on the estimated distribution, instead of a realization of it. However, in that case I have to find a method to estimate the distribution. Do you guys have any thoughts on how to solve such a problem? And what algorithm to use? Any advice is welcome! Thanks! submitted by /u/plopthegnome [link] [comments]
    Help with reward function
    (2 min) Hello :D, I have a problem where an agent needs to learn how to reach a certain target location (environment is a grid world). The agent has 5 possible actions, to move (up, down, right, left) or to stay idle. My reward function is as follows: moving has a cost of -1, while staying idle has a cost of 0. If the agent gets closer to the target, the reward is incremented by 1, and decremented by 1 otherwise. Basically, if the agent moves towards the target, the reward for that steps would be 1-1=0, if the agent moves away the reward would be -1-1 = -2, and if the agent stays idle the reward would be 0-1=-1 (-1 for not getting closer to the target). This works perfectly when the agent moves at a speed I started with. However, when I reduce the speed, the agent never learns to finish the task (actually the agent for some reason decides to stay idle all the time). What could be the reason for this? Decreasing the speed generally means that the agent needs more time steps to finish the task. How could this change result in this problem, when the original speed worked just fine. submitted by /u/AhmedNizam_ [link] [comments]
    "The Rise of A.I. Fighter Pilots: Artificial intelligence is being taught to fly warplanes. Can the technology be trusted?"
    (1 min) submitted by /u/gwern [link] [comments]
    Latest CMU Research Improves Reinforcement Learning With Lookahead Policy: Learning Off-Policy with Online Planning
    (1 min) Reinforcement learning (RL) is a technique that allows artificial agents to learn new tasks by interacting with their surroundings. Because of their capacity to use previously acquired data and incorporate input from several sources, off-policy approaches have lately seen a lot of success in RL for effectively learning behaviors in applications like robotics. What is the mechanism of off-policy reinforcement learning? A parameterized actor and a value function are generally used in a model-free off-policy reinforcement learning approach (see Figure 2). The transitions are recorded in the replay buffer as the actor interacts with the environment. The value function is updated by maximizing the action values at the stages visited in the replay buffer. The actor is trained using the transitions from the replay buffer to forecast the cumulative return of the actor. Continue Reading Paper: https://arxiv.org/pdf/2008.10066.pdf Project: https://hari-sikchi.github.io/loop/ Github: https://github.com/hari-sikchi/LOOP CMU Blog: https://blog.ml.cmu.edu/2022/01/07/loop/ submitted by /u/techsucker [link] [comments]

2022-01-17

  • Data Science Central

    Why Microsoft Power BI Is Among The Most Popular Business Analytics Tools
    (5 min) Nowadays, organizations of varying sizes and types have to use huge data sets for sustenance and revenue growth, and this is a global trend. However, for these businesses, gathering, analyzing, and deciphering huge volumes of data is not a breeze. They have to use specialized software applications to deal with massive amounts of data per… Read More »Why Microsoft Power BI Is Among The Most Popular Business Analytics Tools The post Why Microsoft Power BI Is Among The Most Popular Business Analytics Tools appeared first on Data Science Central.
    CDO Challenge: Providing Clear “Line of Sight” from Data to Value
    (5 min) The end of year always leads to lists of the most impactful technologies from the previous year and projections as to the most impactful technologies for the next year.  These are entertaining debates about topics such data meshes vs data lakes vs data fabrics vs data warehouses vs data lakehouses vs data department stores vs… Read More »CDO Challenge: Providing Clear “Line of Sight” from Data to Value The post CDO Challenge: Providing Clear “Line of Sight” from Data to Value appeared first on Data Science Central.
    Top 12 Software Development Trends in 2022 You Should Watch For
    (5 min) There are constant changes in the software development trends, but a few trends seem to be dominant in 2022. With the evolution of advanced technology, there has been a significant change in the software development landscape. Businesses need to keep up with these changes in order to compete in the next-gen world.  To help you… Read More »Top 12 Software Development Trends in 2022 You Should Watch For The post Top 12 Software Development Trends in 2022 You Should Watch For appeared first on Data Science Central.
    Notable enterprise data trendsetters, 2002 – 2022
    (5 min) As a former tech trends forecaster, I’m always pondering what we could achieve and how a better understanding of history could help us navigate the decades in front of us.  Companies large and small have overcome major data handling obstacles over the past 20 years. It may not seem so, but we’ve gone a long… Read More »Notable enterprise data trendsetters, 2002 – 2022 The post Notable enterprise data trendsetters, 2002 – 2022 appeared first on Data Science Central.
    An Easy Way to Understand the Back-Propagation Algorithm
    (3 min) The back-propagation algorithm is one of the foundations of neural networks and deep learning. It is also hard to understand for beginners. I found a resource, a free chapter of a book by Andrew Glassner, link below that explains backpropagation using a unique and an easy to understand approach Essentially, we start with a contrived… Read More »An Easy Way to Understand the Back-Propagation Algorithm The post An Easy Way to Understand the Back-Propagation Algorithm appeared first on Data Science Central.
  • Artificial Intelligence

    I think it might not be a bad idea to think of the corporation itself as a form of AI
    (1 min) While people are certainly involved deeply with the way a corporation functions. It often seems to display a will of its own. There is an inexorable logic to the way businesses are run, and that is ignoring the potential for bad individual actors. Legally they are in many ways first class citizens, because if they murder people they aren't held meaningfully accountable even when they are caught. I think the AI is programmed in places like business schools. That will teach you how to run a successful business, but also will instill in you a certain worldview. A worldview that has very real consequences since it is misaligned with reality. It ignores externalities and tries to make us pay the cost for their harms. I get that this may be a little far out for some, but I do not say these things lightly. All of the tech companies are starting to show their character, and predictably they are harming the world. submitted by /u/Memetic1 [link] [comments]
    Last Week in AI - AI early warning system for new Covid-19 variants, increased federal spending on facial recognition, AI used to detect fake art, and more!
    (1 min) submitted by /u/regalalgorithm [link] [comments]
    Text to AI generated artwork
    (1 min) One page I discovered recently will take songs and have AI generate artwork for the lyrics. I love this idea, but am very green in AI. What ways could I do this? I would like to take books and poems and have AI generate artwork for them based on just the words. I would also like to be able to have these maybe on a canvas to have in my house. What would be the best way to go about it? Edit: I see the Wombo Dream app has a tool that does exactly this, but only allows for 100 characters, and I want to make something that is my own. submitted by /u/ImpeccableSloth33 [link] [comments]
    Artificial Nightmares : Ronald McDonald's Last Supper || Pytti AI Art Video [4K 60 FPS]
    (0 min) submitted by /u/Thenamessd [link] [comments]
    Creating Synthetic Time Series Data for Global Financial Institutions – a POC Deep Dive
    (0 min) https://gretel.ai/blog/creating-synthetic-time-series-data-for-global-financial-institutions-a-poc-deep-dive submitted by /u/Repeat-or [link] [comments]
    For those of you AI lovers, I have created an AI tool that estimates a price for your Udemy course based on some keywords. If you are interested in how I made such a tool, I wrote an article about it. Freel free to check it out. I hope it inspires you! ( :
    (1 min) submitted by /u/amirdol7 [link] [comments]
    What is voyage?
    (0 min) submitted by /u/roblox22y [link] [comments]
    [R] Pushing the Limits of Self-Supervised ResNets: DeepMind’s ReLICv2 Beats Strong Supervised Baselines on ImageNet
    (1 min) A DeepMind research team proposes ReLICv2, which demonstrates for the first time that representations learned without labels can consistently outperform a strong, supervised baseline on ImageNet and even achieve comparable results to state-of-the-art self-supervised vision transformers (ViTs). Here is a quick read: Pushing the Limits of Self-Supervised ResNets: DeepMind’s ReLICv2 Beats Strong Supervised Baselines on ImageNet. The paper Pushing the Limits of Self-supervised ResNets: Can We Outperform Supervised Learning Without Labels on ImageNet? is on arXiv. submitted by /u/Yuqing7 [link] [comments]
    Bill Gates & Yuval Noah Harari Discuss: What Happens When AI Knows Humans Better Than They Know Themselves? (short audio clip)
    (0 min) submitted by /u/frog9913 [link] [comments]
    Machine learning is only useful if it is based on reliable data.
    (2 min) For engineers interested in the power of artificial intelligence, the application of machine learning to individual health should be the primary focus. AI has the potential to provide higher-quality treatment to a larger proportion of the world's population at a cheaper cost. Drugs that are linked to the genetic code, as well as million-dollar imaging and diagnostic devices, are out of reach for many of the world's poorest people. In order to deliver on the promise of technology, sciencemen must turn to raw materials that are affordable to the vast majority of the world's population. Machine learning is sensitive to the degree of similarity across people; the physician can learn what works for one patient and then tailor suggestions to the unique qualities of another. The revolutionary promise of AI is that health-care systems no longer have to choose between personalization and scalability. What could possibly go wrong? A lot of it. Bias in medical machine learning may be lethal. The most readily available data for training healthcare models does not reflect the worldwide illness load. Similarly, volunteers in clinical studies do not fully reflect the range of patients. Bias in machine learning is frequently caused by training an algorithm using data that does not accurately reflect reality. The machine's world is defined by what you display it. When the training data is realigned, the algorithm learns to rectify previous tendencies. AI shows health systems that their patients are not a block but a group of individuals with different needs and risks. It is counter-intuitive but if done right, it could reinforce the humanity of medicine by emphasising what a provider notices about a patient, making the interaction between patient and provider ever more central. We have to be ready for the changes that AI can bring to medicine. It is necessary for now because of the huge overload caused by the crisis.. submitted by /u/FrothyDigger [link] [comments]
    TinyML is bringing neural networks to small microcontrollers
    (0 min) submitted by /u/bendee983 [link] [comments]
    Approaching the Edge of the Universe: ‘Handle Nebula’ [ARTBREEDER]
    (0 min) submitted by /u/justafriendlymouse [link] [comments]
  • Machine Learning

    [Discussion]Graph network to embedded vector
    (1 min) Hello, I am looking for methods to turn a graph network of variable size into a fixed length vector for input into a NN. I have already found methods suck as graph2vec but am looking for alternative methods as well due to implementation difficulty. This is assuming the graph is already a list of embedded nodes(from deep walk node2vec or any other node embedding method). Inductive embedding methods are preferred due to the shortcomings of transductive techniques. If anyone can provide tips or sources it would be greatly appreciated! submitted by /u/Upstairs-Show-5962 [link] [comments]
    [D] A Science Journalist’s Journey to Understand AI (The Gradient)
    (1 min) Hi there, just want to share the latest Gradient article A Science Journalist’s Journey to Understand AI. Might be an interesting read to get an outsider's perspective on AI, and be reminded of the misconceptions/views non-technical folks have about it. Though some of these things are also quite specific to the author's point of view. In any case, I think it's worth a read. submitted by /u/regalalgorithm [link] [comments]
    [P] UltimateAdam for Tensorflow - transition from SGD to Adam
    (2 min) Hi all. I've been playing around with a new optimizer that I've had some good success with. We know now that one of the main problems with Adam is that variance of adaptive learn rates in the early steps of optimization cause a sub-optimal trajectory, which ultimately leads to converging to a sub-optimal final minima. As it turns out this is the main reason why many people say that SGD is the golden standard for final performance. There have been studies recently that show when Adam is tuned correctly and the variance issues are dealt with, Adam can actually achieve even better final performance than SGD in many cases. Rectified Adam is one way that we try to deal with this variance issue by warming up the learn rate in the early steps, so that the trajectory is more stable and better qu…
    [D] Federated Learning Libraries in 2022?
    (1 min) I'm starting a new experiment on federated learning in Python and was looking to see what libraries were available. It seems like PySyft was the most popular, but there seems to be multiple new options now (IBM, Flower, TensorFlow, FedML). Does anyone have experience to know which of these would be the best to use? submitted by /u/JustARoom [link] [comments]
    [P] How we generate high-quality synthetic time-series data for one of the largest financial institutions in the world.
    (1 min) In this project, we discuss the creation of high-quality synthetic time-series datasets, and the methods we designed to assess the accuracy and privacy of our models and data. Code and examples available with the blog. https://gretel.ai/blog/creating-synthetic-time-series-data-for-global-financial-institutions-a-poc-deep-dive submitted by /u/alig80 [link] [comments]
    [D] NLP: Knowledge based vs neural based
    (0 min) I have been working in nlp with neural language models (W2V, transformers, etc...) for a while and wanted to expand my skill set to the knowledge based approaches. Are knowledge based approaches used in industry and how can they complement neural based approaches? Any book/learning material reference is highly appreciated, thanks! submitted by /u/eatpasta_runfastah [link] [comments]
    [D] AISTATS 2022 Paper Acceptance Result
    (0 min) AISTATS 2022 paper acceptance results are supposed to be released tomorrow. Creating a discussion thread for this year's results. submitted by /u/zy415 [link] [comments]
    [Discussion] How do you organize a data science team process?
    (2 min) Working in R&D can sometimes be a pain in the a*, especially when designing a process that would fit the whole team. I've heard about and have worked on projects where there were no daily meetings, no discussion about technology across the team, and it was kind of free-form till the due date. Enforcing rules on data scientists often backfires with their productivity too. How do you tackle such a problem in strictly research & development areas? submitted by /u/InGraphsWeTrust [link] [comments]
    [R] CoAtNet: Marrying Convolution and Attention for All Data Sizes (video explanation)
    (0 min) Here is an youtube video explaining the best Image Classification model available today namely CoAtNet. Hope its useful: https://youtu.be/VoRQiKQcdcI submitted by /u/Combination-Fun [link] [comments]
    How useful is knowledge of parallel programming in ML? [D]
    (3 min) I'm an aspiring ML engineer. Wondering how useful or applicable is the knowledge of parallel computation in the world of AI/ML? submitted by /u/mha_1992 [link] [comments]
    [D] Joshua Tenenbaum's opinions on Neural Generative Models
    (2 min) TL;DR Hail Mary Request: Does anyone know what Josh thinks of VAE's, GANs and other Neural Generative Models, or maybe have a link to a source in which he describes his opinions on them? Hey all, I've recently been doing some literature review on Probabilistic Programming and have been tearing through Josh Tenenbaum's bibliography and watching several presentations of his. During this, I (maybe) came across a presentation on YouTube in which he mentioned his opinion of the usefulness of VAE's and GANs as generative models. Something along the lines of "They're a step in the right direction (Generative Modelling) but don't work as GM's nearly as well as probabilistic programs". Unfortunately I've totally lost track of the source of this throwaway insight and can't even be sure if it wasn't from a paper, from someone other Probabilistic Programming researcher, or it was an opinion of his I imagined that just happens to fit my model of him (Pun Intended). I've already spent the morning trying to track down the source of this supposed opinion of his so I decided to just try asking this subreddit on the off chance anyone knows what he thinks and especially where he shared it. Thanks for your time, A Certified Money License Issuer submitted by /u/MoneyLicense [link] [comments]
    [P] Cherche - allows you to create a neural search pipeline using retrievers and pre-trained language models as rankers.
    (0 min) submitted by /u/binaryfor [link] [comments]
    [P] Open-source library to process unstructured data for ML tasks
    (1 min) Excited to share Docarray, an open-source python library to store and process unstructured data such as text, image, audio, video, or 3D mesh. Useful in processing data for ML tasks such as embed, search, recommend, etc. DocArray aims to be the data structure for unstructured data DocArray consists of two simple concepts: Document: a data structure for easily representing nested, unstructured data 2 DocumentArray: a container for efficiently accessing, manipulating, and understanding multiple Documents Why did I build it? While working on Jina(an AI powered Search framework), I needed a way to store and process the large amounts of unstructured data for the purpose of creating embedding and build search on top of that. I tried solutions such as json, numpy.ndarray, pandas.DataFrame, Protobuf, etc. But they were not suitable for our computation intensive tasks for unstructured and nested data. Ask me question if you need more info on this. Checkout GitHub repository for examples. This project has been used and tested well in my other project(Jina), but there's a lot of scope of making it more useful for the community. Ask me your questions and share your suggestions submitted by /u/opensourcecolumbus [link] [comments]
  • Reinforcement Learning

    How does OpenAI PPO Parallelism work?
    (1 min) I've already seen this post and this one. But it is still unclear to me some of the details. In particular I'm wondering how this parallelism differs from 'batched environment' parallelism? Is at all unique to PPO? Also how scalable is this? It seems like they are just increasing the batch size enormously (to like 1m examples), does this ever face diminishing returns? submitted by /u/Udon_noodles [link] [comments]
    Selecting agent actions (Beginner Question)
    (1 min) I'd like to think I have a basic amount of high level understanding of RL when asking this question. What would be normal when writing an agent selecting an action, ensuring the agent selects from any valid actions, or letting the action select any action and just penalizing it if the action is illegal and retaining the state. submitted by /u/DisastrousTrainer322 [link] [comments]
    "Dataset Cartography: Mapping and Diagnosing Datasets with Training Dynamics", Swayamdipta et al 2020
    (1 min) submitted by /u/gwern [link] [comments]

2022-01-16

  • Data Science Central

    The Evolution of Smart City Tourism
    (4 min) Smart city tourism uses big data and AI to provide a range of tourist services. Cities are jumping the gap from real-life into the metaverse. The technology has the potential to solve a multitude of logistic and environmental concerns. The advent of the Metaverse—along with Covid travel restrictions and social distancing—has seen an explosion of… Read More »The Evolution of Smart City Tourism The post The Evolution of Smart City Tourism appeared first on Data Science Central.
    Healthcare Robots On the Horizon For Patient Care
    (3 min) The latest trends in robotics, i.e. the integration of robotic technologies with the new and exciting tools and technologies powered by Artificial Intelligence or AI are creating new avenues in both domains. With the emergence of AI-powered robotics, various trailblazers and pioneers in automation are focusing on the adoption and integration of AI technologies in… Read More »Healthcare Robots On the Horizon For Patient Care The post Healthcare Robots On the Horizon For Patient Care appeared first on Data Science Central.
    Neat R Plots with the Cairo Graphics Library
    (4 min) As I spent much of my career in business intelligence and marketing, programming in R was never one of my top skills. While I started programming in R more than 30 years ago, it was never one of my core activities either. Yet I have a background in machine learning and image processing, so I… Read More »Neat R Plots with the Cairo Graphics Library The post Neat R Plots with the Cairo Graphics Library appeared first on Data Science Central.
    4 VR & Online Card Games Gaining Popularity for 2022
    (5 min) As far as we know, the earliest iterations of playing cards began to first appear in Europe during the late 1300’s and early 1400’s, after they were believed to have been imported from the East. Over the generations, these card games have undergone dramatic transformations until they finally arrived at the classic 52 decks four-suit… Read More »4 VR & Online Card Games Gaining Popularity for 2022 The post 4 VR & Online Card Games Gaining Popularity for 2022 appeared first on Data Science Central.
  • Artificial Intelligence

    What are the best AI face editing apps/software you know?
    (0 min) 2 Question: What is the best AI which can turn you into an anime character? submitted by /u/xXNOdrugsForMEXx [link] [comments]
    Most exciting Recent or Upcoming AI Tech?
    (1 min) Im looking for a topic to do a research AI WhitePaper on and I was curious if anyone had any interesting ideas. Does not have to be the most complex thing in the world. Its basically just: the use of an AI / machine learning solution to solve a specific problem in a business. A good example: something like an AI ChatBot replacing HR employee engagement surveys. submitted by /u/kmalternative [link] [comments]
    The Messy History of Facial Recognition Company Clearview AI
    (0 min) submitted by /u/regalalgorithm [link] [comments]
    Their is something about Ai generated architecture.. it always comes out mind blowing
    (1 min) submitted by /u/snowpixelapp [link] [comments]
    I'm looking to join a Project/Research of 3D Machine Learning from abstract 2D Art that makes several outputs of 3Dmodeled Art from a single non-figurative 2D Image.
    (1 min) The applications for this could disrupt Architecture & Sculpture worlds. Which companies/startups/researching teams could be interested in this research PC app project? I'm really interested in being apart of a project like this. Please message me if you want to start. I have more Specs. submitted by /u/Niu_Davinci [link] [comments]
    Isn't this table for perceptron of AND gate wrong for B?Because we use updated bias and not the original bias? Please clear my confusion
    (1 min) submitted by /u/barnyard9 [link] [comments]
    Is Machine Learning Research Oversaturated?
    (0 min) submitted by /u/SlickBlueML [link] [comments]
  • Machine Learning

    [P] Discord Community for game developers using ML models/GPT3 in their games
    (1 min) submitted by /u/22vortex22 [link] [comments]
    [R] AIoT and Federated Learning Platform
    (1 min) Are there any AIoT or Federated learning platforms that are recommended to help write and test/train models for edge devices? Looking for a platform that allows me to write a jupyter notebook, and simulate an environment of N edge devices - each device with unique data. Any other pointers, I'm all ears! submitted by /u/morseky1 [link] [comments]
    [Discussion] A ML-based intelligence model
    (1 min) To my current knowledge, the basis for IQ tests is the g-factor model developped over a century ago by Spearman, using factor analysis. A general intelligence factor has been generally admitted, and a lot of factor analysis studies have been done to estimate such a factor and correlations,etc. More recently, the consensual idea has been of a hierarchical model : the g-factor (with no consensus on its interpretation) at the top, broad abilities below, and specific abilities below. Now, I didn't find any attempt to create a general intelligence model from more modern tools. What about all the non-linear dimensionalily reductions techniques (kernel based, manifold data...) or deep learning models,etc ? submitted by /u/hugoboum [link] [comments]
    [D] Machine Learning - WAYR (What Are You Reading) - Week 129
    (1 min) This is a place to share machine learning research papers, journals, and articles that you're reading this week. If it relates to what you're researching, by all means elaborate and give us your insight, otherwise it could just be an interesting paper you've read. Please try to provide some insight from your understanding and please don't post things which are present in wiki. Preferably you should link the arxiv page (not the PDF, you can easily access the PDF from the summary page but not the other way around) or any other pertinent links. Previous weeks : 1-10 11-20 21-30 31-40 41-50 51-60 61-70 71-80 81-90 91-100 101-110 111-120 121-130 Week 1 Week 11 Week 21 Week 31 Week 41 Week 51 Week 61 Week 71 Week 81 Week 91 Week 101 Week 111 Week 121 Week 2 Week 12 Week 22 Week 32…
    [P] Feature extraction for acoustic signals
    (2 min) Wondering if anyone here has any experience with feature extraction for acoustic signals, specifically broadband noises like gunshots? I’m an ecologist by training and have very little experience with ML but have gone through the literature some and found some methods others have used on acoustic data like extracting mel-frequency cepstrum coefficients. But I am a bit lost on what exactly is being done here if someone is willing to explain this and/or suggest other methods that might be better suited for extracting relevant features from broadband signals. Thank you!!! submitted by /u/Life-Volume135 [link] [comments]
    [R] The Annotated S4: Efficiently Modeling Long Sequences with Structured State Spaces
    (1 min) Blog post: https://srush.github.io/annotated-s4/ Original paper: https://arxiv.org/abs/2111.00396 This is a blog post that introduces a recent model called S4: (Structured State Space for Sequence Modeling). It is somewhat different from most previous architectures such as CNNs, Transformers, etc, and this work seeks to explain the model using code examples and figures. Here's the first paragraph of the post: The Structured State Space for Sequence Modeling (S4) architecture is a new approach to very long-range sequence modeling tasks for vision, language, and audio, showing a capacity to capture dependencies over tens of thousands of steps. Especially impressive are the model’s results on the challenging Long Range Arena benchmark, showing an ability to reason over sequences of up to 16,000+ elements with high accuracy. Rosanne Liu (one of the co-founders of ML Collective) considers the paper as one of the most underrated papers of 2021, and labeled the Annotated S4 blog post as a "must-read" (tweet). submitted by /u/jwuphysics [link] [comments]
    [R] Instant Neural Graphics Primitives with a Multiresolution Hash Encoding (Training a NeRF takes 5 seconds!)
    (3 min) submitted by /u/Illustrious_Row_9971 [link] [comments]
    [D] Simple Questions Thread
    (2 min) Please post your questions here instead of creating a new thread. Encourage others who create new posts for questions to post here instead! Thread will stay alive until next one so keep posting after the date in the title. Thanks to everyone for answering questions in the previous thread! submitted by /u/AutoModerator [link] [comments]
    [P] I'm building a Neural Search Plugin for Elastic/Opensearch
    (2 min) Hey Everyone, I'm building a plugin for end-to-end neural search (eg SBERT, Hugging Face, Doc2Vec etc) in Opensearch and I'd love to hear any suggestions from the community of what you might find useful or about issues you've had with Opensearch/neural search in the past. Despite the Elasticsearch website claiming in many places that you can do "machine learning" with Elasticsearch, I've found that it's not straight forward at all to use neural search algos with ES/Opensearch. In most cases (putting to one side some specific cases like anomaly detection), you have to implement the ML algorithm yourself and you only get to use ES as a storage layer. Some 3rd party plug and play frameworks that support ES seem quite promising but also lack functionality in terms of data retrieval - for exa…
    [D] Best Object Segmentation Repo with Temporal Dependency?
    (1 min) submitted by /u/ZeApelido [link] [comments]
    [P] Ornette: ML-based Music Generation System
    (1 min) Hi all. I'm developing a TUI tool for continuous music generation with Machine Learning. You can pick a machine learning model and generate sound with it via SuperCollider, or save its output as MIDI and render the audio yourself later. You can also load some MIDI clip to use it as a primer. I'll be happy to hear opinions and answer questions. Cheers video demo github submitted by /u/ghalestrilo [link] [comments]
    [D] Generic version of Numerai (with non-trading challenges)?
    (0 min) Does anyone know if there exists a generic version of Numerai? What I mean is a platform where, just like on Numerai, you can get paid as a data scientist for solving challenges, but where challenges are more broad and not just trading? submitted by /u/xepo3abp [link] [comments]
    [P] Open-source tool for building NLP training sets with weak supervision and search queries
    (1 min) submitted by /u/dvilasuero [link] [comments]
    [D] What are the popular AI/ML Conferences?
    (2 min) We created a page that organizes papers from popular conferences. Which conferences have we missed? Link: https://papers.labml.ai/conferences We collected papers from ICLR, NEURIPS, ICCV, CVPR, ICML, IJCAI, EMNLP, ACL and AAAI. ICLR 2022 has papers from the review stage. PS: Since most AI paper preprints are readily available in Arxiv now, what effect do you think it has if the paper is published in a popular conference? submitted by /u/hnipun [link] [comments]
    [R] Consistent Style Transfer
    (0 min) submitted by /u/Illustrious_Row_9971 [link] [comments]
    [P] Getting 3D sneaker models from a single 2D sideview
    (1 min) submitted by /u/InfamousPancakes [link] [comments]
    [D] Why isn't crowdsourced ML more popular?
    (3 min) I'm wondering why projects like OpenML haven't really taken off. The idea behind it is that people in science who have datasets they'd like modeled can upload their data, set tasks or questions they want to answer, and then let people who want to hone their ML skills attempt to create predictive models from the data. It was launched in 2014, but hasn't really gained any traction in the ML and data science world. Do people here have any thoughts on why crowdsourced ML hasn't caught on? submitted by /u/EducationalCicada [link] [comments]
  • Neural Networks, Deep Learning and Machine Learning

    Wine Quality Prediction with Sequential and Decision Tree models | Kolbenkraft
    (0 min) submitted by /u/cjmodi306 [link] [comments]
    I'm building a plugin for neural search in ElasticSearch/Opensearch
    (1 min) Hey Everyone, I'm building a plugin for end-to-end neural search (eg SBERT, Hugging Face, Doc2Vec etc) in Opensearch and it would be great to hear any suggestions from the NLP community of what you might find useful or about issues you've had with Opensearch/neural search in the past. Ideally, the plugin would allow the user to configure/train an algorithm and then it would allow them to work entirely with text, using the vector representations in the background. I'm also exploring extending this and allowing users to refresh/update these embeddings to continually improve them. Let me know what you think, open to any suggestions! If you want to keep up to date with this, here is a google form https://forms.gle/acmGTK1gPkPZbVJm8 submitted by /u/everythingserverless [link] [comments]
    Isn't this table for perceptron of AND gate wrong for B?Because we use updated bias and not the original bias? Please clear my confusion
    (1 min) submitted by /u/barnyard9 [link] [comments]
  • Reinforcement Learning

    Are multi agent or self-play environments always automatically POMDPs?
    (1 min) Let’s say we look at the game of Atari Pong. In Deepminds Atari paper from 2015, the state was represented by a stack of 4 frames, so it would have the Markov property regarding the ball speed and direction. I’m having a similar environment to Pong and wanted to perform Self-Play to enhance the wall clock training time pf my agent. I considered Pong with a state representation as described above as an MDP. But with another agent, I’m not so sure. I mean the strategies and intentions of the enemy are not encoded in the observation. Does this make the example a POMDP? And wouldn’t this mean that most of multi agent environments would be POMDP? Even though the game is a perfect information game? submitted by /u/Educational_Dot_3024 [link] [comments]
    Clarification regarding advantage calculation in PPO
    (1 min) Hello :D, Regarding the advantage calculation (in GAE for example), what happens if multiple episodes finish within the same batch. I've seen works that mask out everything after an episode terminates, but im not sure if they keep all finished episodes or keep only one finished episode, in case the batch has multiple finished episodes. submitted by /u/AhmedNizam_ [link] [comments]
    Please give me a tip on training InverseRL (like GAIL)
    (1 min) Hi, I'm trying to train some agents in POMDP tabular environment using expert trajectories. (I know it's not the best way to solve this environment.) But, In my case, not only agent but also discriminator gets limited observation. First, behavior cloning showed acceptable performances. But, GAIL with PPO(+cross entropy term) showed very unstable learning. The agent got high value, but cumulated real rewards were so small. I tried to change network capacity I tried to change activation function to tanh I tried to change agent's learning rate(smaller than before) Please give me a tip for training GAIL discriminator! Very Thanks! (train result image is not uploaded. so plz click this link) ​ submitted by /u/Spiritual_Fig3632 [link] [comments]

2022-01-15

  • cs.LG updates on arXiv.org

    An adaptable cognitive microcontroller node for fitness activity recognition. (arXiv:2201.05110v1 [cs.LG])
    (2 min) The new generation of wireless technologies, fitness trackers, and devices with embedded sensors can have a big impact on healthcare systems and quality of life. Among the most crucial aspects to consider in these devices are the accuracy of the data produced and power consumption. Many of the events that can be monitored, while apparently simple, may not be easily detectable and recognizable by devices equipped with embedded sensors, especially on devices with low computing capabilities. It is well known that deep learning reduces the study of features that contribute to the recognition of the different target classes. In this work, we present a portable and battery-powered microcontroller-based device applicable to a wobble board. Wobble boards are low-cost equipment that can be used for sensorimotor training to avoid ankle injuries or as part of the rehabilitation process after an injury. The exercise recognition process was implemented through the use of cognitive techniques based on deep learning. To reduce power consumption, we add an adaptivity layer that dynamically manages the device's hardware and software configuration to adapt it to the required operating mode at runtime. Our experimental results show that adjusting the node configuration to the workload at runtime can save up to 60% of the power consumed. On a custom dataset, our optimized and quantized neural network achieves an accuracy value greater than 97% for detecting some specific physical exercises on a wobble board.
    Outcome-Driven Dynamic Refugee Assignment with Allocation Balancing. (arXiv:2007.03069v4 [math.OC] UPDATED)
    (2 min) This study proposes two new dynamic assignment algorithms to match refugees and asylum seekers to geographic localities within a host country. The first, currently implemented in a multi-year pilot in Switzerland, seeks to maximize the average predicted employment level (or any measured outcome of interest) of refugees through a minimum-discord online assignment algorithm. Although the proposed algorithm achieves near-optimal expected employment compared to the hindsight-optimal solution (and improves upon the status quo procedure by about 40%), it results in a periodically imbalanced allocation to the localities over time. This leads to undesirable workload inefficiencies for resettlement resources and agents. To address this problem, the second algorithm balances the goal of improving refugee outcomes with the desire for an even allocation over time. The performance of the proposed methods is illustrated using real refugee resettlement data from a large resettlement agency in the United States. On this dataset, we find that the allocation balancing algorithm can achieve near-perfect balance over time with only a small loss in expected employment compared to the pure employment-maximizing algorithm. In addition, the allocation balancing algorithm offers a number of ancillary benefits compared to pure outcome-maximization, including robustness to unknown arrival flows and greater exploration.
    DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism. (arXiv:2105.02446v5 [eess.AS] UPDATED)
    (2 min) Singing voice synthesis (SVS) systems are built to synthesize high-quality and expressive singing voice, in which the acoustic model generates the acoustic features (e.g., mel-spectrogram) given a music score. Previous singing acoustic models adopt a simple loss (e.g., L1 and L2) or generative adversarial network (GAN) to reconstruct the acoustic features, while they suffer from over-smoothing and unstable training issues respectively, which hinder the naturalness of synthesized singing. In this work, we propose DiffSinger, an acoustic model for SVS based on the diffusion probabilistic model. DiffSinger is a parameterized Markov chain that iteratively converts the noise into mel-spectrogram conditioned on the music score. By implicitly optimizing variational bound, DiffSinger can be stably trained and generate realistic outputs. To further improve the voice quality and speed up inference, we introduce a shallow diffusion mechanism to make better use of the prior knowledge learned by the simple loss. Specifically, DiffSinger starts generation at a shallow step smaller than the total number of diffusion steps, according to the intersection of the diffusion trajectories of the ground-truth mel-spectrogram and the one predicted by a simple mel-spectrogram decoder. Besides, we propose boundary prediction methods to locate the intersection and determine the shallow step adaptively. The evaluations conducted on a Chinese singing dataset demonstrate that DiffSinger outperforms state-of-the-art SVS work. Extensional experiments also prove the generalization of our methods on text-to-speech task (DiffSpeech). Audio samples: https://diffsinger.github.io. Codes: https://github.com/MoonInTheRiver/DiffSinger.
    Learning to Break Deep Perceptual Hashing: The Use Case NeuralHash. (arXiv:2111.06628v3 [cs.LG] UPDATED)
    (2 min) Apple recently revealed its deep perceptual hashing system NeuralHash to detect child sexual abuse material (CSAM) on user devices before files are uploaded to its iCloud service. Public criticism quickly arose regarding the protection of user privacy and the system's reliability. In this paper, we present the first comprehensive empirical analysis of deep perceptual hashing based on NeuralHash. Specifically, we show that current deep perceptual hashing may not be robust. An adversary can manipulate the hash values by applying slight changes in images, either induced by gradient-based approaches or simply by performing standard image transformations, forcing or preventing hash collisions. Such attacks permit malicious actors easily to exploit the detection system: from hiding abusive material to framing innocent users, everything is possible. Moreover, using the hash values, inferences can still be made about the data stored on user devices. In our view, based on our results, deep perceptual hashing in its current form is generally not ready for robust client-side scanning and should not be used from a privacy perspective.
    Deep clustering with fusion autoencoder. (arXiv:2201.04727v1 [cs.LG])
    (2 min) Embracing the deep learning techniques for representation learning in clustering research has attracted broad attention in recent years, yielding a newly developed clustering paradigm, viz. the deep clustering (DC). Typically, the DC models capitalize on autoencoders to learn the intrinsic features which facilitate the clustering process in consequence. Nowadays, a generative model named variational autoencoder (VAE) has got wide acceptance in DC studies. Nevertheless, the plain VAE is insufficient to perceive the comprehensive latent features, leading to the deteriorative clustering performance. In this paper, a novel DC method is proposed to address this issue. Specifically, the generative adversarial network and VAE are coalesced into a new autoencoder called fusion autoencoder (FAE) for discerning more discriminative representation that benefits the downstream clustering task. Besides, the FAE is implemented with the deep residual network architecture which further enhances the representation learning ability. Finally, the latent space of the FAE is transformed to an embedding space shaped by a deep dense neural network for pulling away different clusters from each other and collapsing data points within individual clusters. Experiment conducted on several image datasets demonstrate the effectiveness of the proposed DC model against the baseline methods.
    Meta-Learning Guarantees for Online Receding Horizon Learning Control. (arXiv:2010.11327v14 [eess.SY] UPDATED)
    (2 min) In this paper we provide provable regret guarantees for an online meta-learning receding horizon control algorithm in an iterative control setting. We consider the setting where, in each iteration the system to be controlled is a linear deterministic system that is different and unknown, the cost for the controller in an iteration is a general additive cost function and there are affine control input constraints. By analysing conditions under which sub-linear regret is achievable, we prove that the meta-learning online receding horizon controller achieves an average of the dynamic regret for the controller cost that is $\tilde{O}((1+1/\sqrt{N})T^{3/4})$ with the number of iterations $N$. Thus, we show that the worst regret for learning within an iteration improves with experience of more iterations, with guarantee on rate of improvement.
    Modeling Implicit Bias with Fuzzy Cognitive Maps. (arXiv:2112.12713v2 [cs.LG] UPDATED)
    (2 min) This paper presents a Fuzzy Cognitive Map model to quantify implicit bias in structured datasets where features can be numeric or discrete. In our proposal, problem features are mapped to neural concepts that are initially activated by experts when running what-if simulations, whereas weights connecting the neural concepts represent absolute correlation/association patterns between features. In addition, we introduce a new reasoning mechanism equipped with a normalization-like transfer function that prevents neurons from saturating. Another advantage of this new reasoning mechanism is that it can easily be controlled by regulating nonlinearity when updating neurons' activation values in each iteration. Finally, we study the convergence of our model and derive analytical conditions concerning the existence and unicity of fixed-point attractors.
    Auto IV: Counterfactual Prediction via Automatic Instrumental Variable Decomposition. (arXiv:2107.05884v2 [cs.LG] UPDATED)
    (2 min) Instrumental variables (IVs), sources of treatment randomization that are conditionally independent of the outcome, play an important role in causal inference with unobserved confounders. However, the existing IV-based counterfactual prediction methods need well-predefined IVs, while it is an art rather than science to find valid IVs in many real-world scenes. Moreover, the predefined hand-made IVs could be weak or erroneous by violating the conditions of valid IVs. These thorny facts hinder the application of the IV-based counterfactual prediction methods. In this paper, we propose a novel Automatic Instrumental Variable decomposition (AutoIV) algorithm to automatically generate representations serving the role of IVs from observed variables (IV candidates). Specifically, we let the learned IV representations satisfy the relevance condition with the treatment and exclusion condition with the outcome via mutual information maximization and minimization constraints, respectively. We also learn confounder representations by encouraging them to be relevant to both the treatment and the outcome. The IV and confounder representations compete for the information with their constraints in an adversarial game, which allows us to get valid IV representations for IV-based counterfactual prediction. Extensive experiments demonstrate that our method generates valid IV representations for accurate IV-based counterfactual prediction.
    Flexible Networks for Learning Physical Dynamics of Deformable Objects. (arXiv:2112.03728v2 [cs.CV] UPDATED)
    (2 min) Learning the physical dynamics of deformable objects with particle-based representation has been the objective of many computational models in machine learning. While several state-of-the-art models have achieved this objective in simulated environments, most existing models impose a precondition, such that the input is a sequence of ordered point sets. That is, the order of the points in each point set must be the same across the entire input sequence. This precondition restrains the model from generalizing to real-world data, which is considered to be a sequence of unordered point sets. In this paper, we propose a model named time-wise PointNet (TP-Net) that solves this problem by directly consuming a sequence of unordered point sets to infer the future state of a deformable object with particle-based representation. Our model consists of a shared feature extractor that extracts global features from each input point set in parallel and a prediction network that aggregates and reasons on these features for future prediction. The key concept of our approach is that we use global features rather than local features to achieve invariance to input permutations and ensure the stability and scalability of our model. Experiments demonstrate that our model achieves state-of-the-art performance with real-time prediction speed in both synthetic dataset and real-world dataset. In addition, we provide quantitative and qualitative analysis on why our approach is more effective and efficient than existing approaches.
    DebiasedDTA: Model Debiasing to Boost Drug-Target Affinity Prediction. (arXiv:2107.05556v3 [q-bio.QM] UPDATED)
    (2 min) Motivation: Computational models that accurately identify high-affinity protein-chemical pairs can accelerate drug discovery pipelines. These models, trained on available protein-chemical interaction datasets, can be used to predict the binding affinity of an input protein-chemical pair. However, the training datasets may contain surface patterns, called dataset biases, which cause models to memorize dataset-specific biomolecule properties, instead of learning binding mechanisms. As a result, the prediction performance of models drops for unseen biomolecules. Here, we present DebiasedDTA, a novel drug-target affinity (DTA) prediction model training framework that addresses dataset biases to improve affinity prediction for novel biomolecules. DebiasedDTA uses ensemble learning and sample weight adaptation to identify and avoid biases and is applicable to most DTA prediction models. Results: The results show that DebiasedDTA can boost models while predicting the interactions between unseen biomolecules. In addition, prediction performance for seen biomolecules also improves. The experiments also show that DebiasedDTA can augment DTA prediction models of different input and model structures and is able to avoid biases of different sources. The investigations of predictions reveal that model debiasing can diminish the importance of misleading features and can enable models to learn more from the proteins. DebiasedDTA is published as an open-source python package to enable debiasing custom DTA prediction models with only two lines of code.
    DS-Sync: Addressing Network Bottlenecks with Divide-and-Shuffle Synchronization for Distributed DNN Training. (arXiv:2007.03298v2 [cs.LG] UPDATED)
    (2 min) Bulk synchronous parallel (BSP) is the de-facto paradigm for distributed DNN training in today's production clusters. However, due to the global synchronization nature, its performance can be significantly influenced by network bottlenecks caused by either static topology heterogeneity or dynamic bandwidth contentions. Existing solutions, either system-level optimizations strengthening BSP (e.g., Ring or Hierarchical All-reduce) or algorithmic optimizations replacing BSP (e.g., ASP or SSP, which relax the global barriers), do not completely solve the problem, as they may still suffer from communication inefficiency or risk convergence inaccuracy. In this paper, we present a novel divide-and-shuffle synchronization (DS-Sync) to realize communication efficiency without sacrificing convergence accuracy for distributed DNN training. At its heart, by taking into account the network bottlenecks, DS-Sync improves communication efficiency by dividing workers into non-overlap groups to synchronize independently in a bottleneck-free manner. Meanwhile, it maintains convergence accuracy by iteratively shuffling workers among different groups to ensure a global consensus. We theoretically prove that DS-Sync converges properly in non-convex and smooth conditions like DNN. We further implement DS-Sync and integrate it with PyTorch, and our testbed experiments show that DS-Sync can achieve up to $94\%$ improvements on the end-to-end training time with existing solutions while maintaining the same accuracy.
    Multimodal Representation for Neural Code Search. (arXiv:2107.00992v3 [cs.SE] UPDATED)
    (2 min) Semantic code search is about finding semantically relevant code snippets for a given natural language query. In the state-of-the-art approaches, the semantic similarity between code and query is quantified as the distance of their representation in the shared vector space. In this paper, to improve the vector space, we introduce tree-serialization methods on a simplified form of AST and build the multimodal representation for the code data. We conduct extensive experiments using a single corpus that is large-scale and multi-language: CodeSearchNet. Our results show that both our tree-serialized representations and multimodal learning model improve the performance of code search. Last, we define intuitive quantification metrics oriented to the completeness of semantic and syntactic information of the code data, to help understand the experimental findings.
    NURBS-Diff: A Differentiable Programming Module for NURBS. (arXiv:2104.14547v4 [cs.LG] UPDATED)
    (2 min) Boundary representations (B-reps) using Non-Uniform Rational B-splines (NURBS) are the de facto standard used in CAD, but their utility in deep learning-based approaches is not well researched. We propose a differentiable NURBS module to integrate NURBS representations of CAD models with deep learning methods. We mathematically define the derivatives of the NURBS curves or surfaces with respect to the input parameters (control points, weights, and the knot vector). These derivatives are used to define an approximate Jacobian used for performing the "backward" evaluation to train the deep learning models. We have implemented our NURBS module using GPU-accelerated algorithms and integrated it with PyTorch, a popular deep learning framework. We demonstrate the efficacy of our NURBS module in performing CAD operations such as curve or surface fitting and surface offsetting. Further, we show its utility in deep learning for unsupervised point cloud reconstruction and enforce analysis constraints. These examples show that our module performs better for certain deep learning frameworks and can be directly integrated with any deep-learning framework requiring NURBS.
    SynthBio: A Case Study in Human-AI Collaborative Curation of Text Datasets. (arXiv:2111.06467v2 [cs.CL] UPDATED)
    (2 min) NLP researchers need more, higher-quality text datasets. Human-labeled datasets are expensive to collect, while datasets collected via automatic retrieval from the web such as WikiBio are noisy and can include undesired biases. Moreover, data sourced from the web is often included in datasets used to pretrain models, leading to inadvertent cross-contamination of training and test sets. In this work we introduce a novel method for efficient dataset curation: we use a large language model to provide seed generations to human raters, thereby changing dataset authoring from a writing task to an editing task. We use our method to curate SynthBio - a new evaluation set for WikiBio - composed of structured attribute lists describing fictional individuals, mapped to natural language biographies. We show that our dataset of fictional biographies is less noisy than WikiBio, and also more balanced with respect to gender and nationality.
    A Comparative Study on Basic Elements of Deep Learning Models for Spatial-Temporal Traffic Forecasting. (arXiv:2111.07513v2 [cs.LG] UPDATED)
    (2 min) Traffic forecasting plays a crucial role in intelligent transportation systems. The spatial-temporal complexities in transportation networks make the problem especially challenging. The recently suggested deep learning models share basic elements such as graph convolution, graph attention, recurrent units, and/or attention mechanism. In this study, we designed an in-depth comparative study for four deep neural network models utilizing different basic elements. For base models, one RNN-based model and one attention-based model were chosen from previous literature. Then, the spatial feature extraction layers in the models were substituted with graph convolution and graph attention. To analyze the performance of each element in various environments, we conducted experiments on four real-world datasets - highway speed, highway flow, urban speed from a homogeneous road link network, and urban speed from a heterogeneous road link network. The results demonstrate that the RNN-based model and the attention-based model show a similar level of performance for short-term prediction, and the attention-based model outperforms the RNN in longer-term predictions. The choice of graph convolution and graph attention makes a larger difference in the RNN-based models. Also, our modified version of GMAN shows comparable performance with the original with less memory consumption.
    On component interactions in two-stage recommender systems. (arXiv:2106.14979v3 [cs.IR] UPDATED)
    (2 min) Thanks to their scalability, two-stage recommenders are used by many of today's largest online platforms, including YouTube, LinkedIn, and Pinterest. These systems produce recommendations in two steps: (i) multiple nominators, tuned for low prediction latency, preselect a small subset of candidates from the whole item pool; (ii) a slower but more accurate ranker further narrows down the nominated items, and serves to the user. Despite their popularity, the literature on two-stage recommenders is relatively scarce, and the algorithms are often treated as mere sums of their parts. Such treatment presupposes that the two-stage performance is explained by the behavior of the individual components in isolation. This is not the case: using synthetic and real-world data, we demonstrate that interactions between the ranker and the nominators substantially affect the overall performance. Motivated by these findings, we derive a generalization lower bound which shows that independent nominator training can lead to performance on par with uniformly random recommendations. We find that careful design of item pools, each assigned to a different nominator, alleviates these issues. As manual search for a good pool allocation is difficult, we propose to learn one instead using a Mixture-of-Experts based approach. This significantly improves both precision and recall at K.
    A Concentration Bound for LSPE($\lambda$). (arXiv:2111.02644v2 [cs.LG] UPDATED)
    (2 min) The popular LSPE($\lambda$) algorithm for policy evaluation is revisited to derive a concentration bound that gives high probability performance guarantees from some time on.
    Discovering Governing Equations from Partial Measurements with Deep Delay Autoencoders. (arXiv:2201.05136v1 [cs.LG])
    (2 min) A central challenge in data-driven model discovery is the presence of hidden, or latent, variables that are not directly measured but are dynamically important. Takens' theorem provides conditions for when it is possible to augment these partial measurements with time delayed information, resulting in an attractor that is diffeomorphic to that of the original full-state system. However, the coordinate transformation back to the original attractor is typically unknown, and learning the dynamics in the embedding space has remained an open challenge for decades. Here, we design a custom deep autoencoder network to learn a coordinate transformation from the delay embedded space into a new space where it is possible to represent the dynamics in a sparse, closed form. We demonstrate this approach on the Lorenz, R\"ossler, and Lotka-Volterra systems, learning dynamics from a single measurement variable. As a challenging example, we learn a Lorenz analogue from a single scalar variable extracted from a video of a chaotic waterwheel experiment. The resulting modeling framework combines deep learning to uncover effective coordinates and the sparse identification of nonlinear dynamics (SINDy) for interpretable modeling. Thus, we show that it is possible to simultaneously learn a closed-form model and the associated coordinate system for partially observed dynamics.
    Periodic Freight Demand Estimation for Large-scale Tactical Planning. (arXiv:2105.09136v3 [cs.LG] UPDATED)
    (2 min) Freight carriers rely on tactical planning to design their service network to satisfy demand in a cost-effective way. For computational tractability, deterministic and cyclic Service Network Design (SND) formulations are used to solve large-scale problems. A central input is the periodic demand, that is, the demand expected to repeat in every period in the planning horizon. In practice, demand is predicted by a time series forecasting model and the periodic demand is the average of those forecasts. This is, however, only one of many possible mappings. The problem consisting in selecting this mapping has hitherto been overlooked in the literature. We propose to use the structure of the downstream decision-making problem to select a good mapping. For this purpose, we introduce a multilevel mathematical programming formulation that explicitly links the time series forecasts to the SND problem of interest. The solution is a periodic demand estimate that minimizes costs over the tactical planning horizon. We report results in an extensive empirical study of a large-scale application from the Canadian National Railway Company. They clearly show the importance of the periodic demand estimation problem. Indeed, the planning costs exhibit an important variation over different periodic demand estimates and using an estimate different from the mean forecast can lead to substantial cost reductions. Moreover, the costs associated with the periodic demand estimates based on forecasts were comparable to, or even better than those obtained using the mean of actual demand.
    Generalized Kernel Ridge Regression for Long Term Causal Inference: Treatment Effects, Dose Responses, and Counterfactual Distributions. (arXiv:2201.05139v1 [econ.EM])
    (2 min) I propose kernel ridge regression estimators for long term causal inference, where a short term experimental data set containing randomized treatment and short term surrogates is fused with a long term observational data set containing short term surrogates and long term outcomes. I propose estimators of treatment effects, dose responses, and counterfactual distributions with closed form solutions in terms of kernel matrix operations. I allow covariates, treatment, and surrogates to be discrete or continuous, and low, high, or infinite dimensional. For long term treatment effects, I prove $\sqrt{n}$ consistency, Gaussian approximation, and semiparametric efficiency. For long term dose responses, I prove uniform consistency with finite sample rates. For long term counterfactual distributions, I prove convergence in distribution.
    Implicit Bias of MSE Gradient Optimization in Underparameterized Neural Networks. (arXiv:2201.04738v1 [stat.ML])
    (2 min) We study the dynamics of a neural network in function space when optimizing the mean squared error via gradient flow. We show that in the underparameterized regime the network learns eigenfunctions of an integral operator $T_{K^\infty}$ determined by the Neural Tangent Kernel (NTK) at rates corresponding to their eigenvalues. For example, for uniformly distributed data on the sphere $S^{d - 1}$ and rotation invariant weight distributions, the eigenfunctions of $T_{K^\infty}$ are the spherical harmonics. Our results can be understood as describing a spectral bias in the underparameterized regime. The proofs use the concept of "Damped Deviations", where deviations of the NTK matter less for eigendirections with large eigenvalues due to the occurence of a damping factor. Aside from the underparameterized regime, the damped deviations point-of-view can be used to track the dynamics of the empirical risk in the overparameterized setting, allowing us to extend certain results in the literature. We conclude that damped deviations offers a simple and unifying perspective of the dynamics when optimizing the squared error.
    Deep Learning-based Extreme Heatwave Forecast. (arXiv:2103.09743v3 [cs.LG] UPDATED)
    (2 min) Because of the impact of extreme heat waves and heat domes on society and biodiversity, their study is a key challenge. We specifically study long-lasting extreme heat waves, which are among the most important for climate impacts. Physics driven weather forecast systems or climate models can be used to forecast their occurrence or predict their probability. The present work explores the use of deep learning architectures, trained using outputs of a climate model, as an alternative strategy to forecast the occurrence of extreme long-lasting heatwaves. This new approach will be useful for several key scientific goals which include the study of climate model statistics, building a quantitative proxy for resampling rare events in climate models, study the impact of climate change, and should eventually be useful for forecasting. Fulfilling these important goals implies addressing issues such as class-size imbalance that is intrinsically associated with rare event prediction, assessing the potential benefits of transfer learning to address the nested nature of extreme events (naturally included in less extreme ones). We train a Convolutional Neural Network, using 1000 years of climate model outputs, with large-class undersampling and transfer learning. From the observed snapshots of the surface temperature and the 500 hPa geopotential height fields, the trained network achieves significant performance in forecasting the occurrence of long-lasting extreme heatwaves. We are able to predict them at three different levels of intensity, and as early as 15 days ahead of the start of the event (30 days ahead of the end of the event).
    Symmetric Sparse Boolean Matrix Factorization and Applications. (arXiv:2102.01570v3 [cs.LG] UPDATED)
    (2 min) In this work, we study a variant of nonnegative matrix factorization where we wish to find a symmetric factorization of a given input matrix into a sparse, Boolean matrix. Formally speaking, given $\mathbf{M}\in\mathbb{Z}^{m\times m}$, we want to find $\mathbf{W}\in\{0,1\}^{m\times r}$ such that $\| \mathbf{M} - \mathbf{W}\mathbf{W}^\top \|_0$ is minimized among all $\mathbf{W}$ for which each row is $k$-sparse. This question turns out to be closely related to a number of questions like recovering a hypergraph from its line graph, as well as reconstruction attacks for private neural network training. As this problem is hard in the worst-case, we study a natural average-case variant that arises in the context of these reconstruction attacks: $\mathbf{M} = \mathbf{W}\mathbf{W}^{\top}$ for $\mathbf{W}$ a random Boolean matrix with $k$-sparse rows, and the goal is to recover $\mathbf{W}$ up to column permutation. Equivalently, this can be thought of as recovering a uniformly random $k$-uniform hypergraph from its line graph. Our main result is a polynomial-time algorithm for this problem based on bootstrapping higher-order information about $\mathbf{W}$ and then decomposing an appropriate tensor. The key ingredient in our analysis, which may be of independent interest, is to show that such a matrix $\mathbf{W}$ has full column rank with high probability as soon as $m = \widetilde{\Omega}(r)$, which we do using tools from Littlewood-Offord theory and estimates for binary Krawtchouk polynomials.
    Improved Multi-objective Data Stream Clustering with Time and Memory Optimization. (arXiv:2201.05079v1 [cs.LG])
    (2 min) The analysis of data streams has received considerable attention over the past few decades due to sensors, social media, etc. It aims to recognize patterns in an unordered, infinite, and evolving stream of observations. Clustering this type of data requires some restrictions in time and memory. This paper introduces a new data stream clustering method (IMOC-Stream). This method, unlike the other clustering algorithms, uses two different objective functions to capture different aspects of the data. The goal of IMOC-Stream is to: 1) reduce computation time by using idle times to apply genetic operations and enhance the solution. 2) reduce memory allocation by introducing a new tree synopsis. 3) find arbitrarily shaped clusters by using a multi-objective framework. We conducted an experimental study with high dimensional stream datasets and compared them to well-known stream clustering techniques. The experiments show the ability of our method to partition the data stream in arbitrarily shaped, compact, and well-separated clusters while optimizing the time and memory. Our method also outperformed most of the stream algorithms in terms of NMI and ARAND measures.
    Generalized Shape Metrics on Neural Representations. (arXiv:2110.14739v2 [stat.ML] UPDATED)
    (2 min) Understanding the operation of biological and artificial networks remains a difficult and important challenge. To identify general principles, researchers are increasingly interested in surveying large collections of networks that are trained on, or biologically adapted to, similar tasks. A standardized set of analysis tools is now needed to identify how network-level covariates -- such as architecture, anatomical brain region, and model organism -- impact neural representations (hidden layer activations). Here, we provide a rigorous foundation for these analyses by defining a broad family of metric spaces that quantify representational dissimilarity. Using this framework we modify existing representational similarity measures based on canonical correlation analysis to satisfy the triangle inequality, formulate a novel metric that respects the inductive biases in convolutional layers, and identify approximate Euclidean embeddings that enable network representations to be incorporated into essentially any off-the-shelf machine learning method. We demonstrate these methods on large-scale datasets from biology (Allen Institute Brain Observatory) and deep learning (NAS-Bench-101). In doing so, we identify relationships between neural representations that are interpretable in terms of anatomical features and model performance.
    Neural Koopman Lyapunov Control. (arXiv:2201.05098v1 [eess.SY])
    (2 min) Learning and synthesizing stabilizing controllers for unknown nonlinear systems is a challenging problem for real-world and industrial applications. Koopman operator theory allow one to analyze nonlinear systems through the lens of linear systems and nonlinear control systems through the lens of bilinear control systems. The key idea of these methods, lies in the transformation of the coordinates of the nonlinear system into the Koopman observables, which are coordinates that allow the representation of the original system (control system) as a higher dimensional linear (bilinear control) system. However, for nonlinear control systems, the bilinear control model obtained by applying Koopman operator based learning methods is not necessarily stabilizable and therefore, the existence of a stabilizing feedback control is not guaranteed which is crucial for many real world applications. Simultaneous identification of these stabilizable Koopman based bilinear control systems as well as the associated Koopman observables is still an open problem. In this paper, we propose a framework to identify and construct these stabilizable bilinear models and its associated observables from data by simultaneously learning a bilinear Koopman embedding for the underlying unknown nonlinear control system as well as a Control Lyapunov Function (CLF) for the Koopman based bilinear model using a learner and falsifier. Our proposed approach thereby provides provable guarantees of global asymptotic stability for the nonlinear control systems with unknown dynamics. Numerical simulations are provided to validate the efficacy of our proposed class of stabilizing feedback controllers for unknown nonlinear systems.
    The curse of overparametrization in adversarial training: Precise analysis of robust generalization for random features regression. (arXiv:2201.05149v1 [cs.LG])
    (2 min) Successful deep learning models often involve training neural network architectures that contain more parameters than the number of training samples. Such overparametrized models have been extensively studied in recent years, and the virtues of overparametrization have been established from both the statistical perspective, via the double-descent phenomenon, and the computational perspective via the structural properties of the optimization landscape. Despite the remarkable success of deep learning architectures in the overparametrized regime, it is also well known that these models are highly vulnerable to small adversarial perturbations in their inputs. Even when adversarially trained, their performance on perturbed inputs (robust generalization) is considerably worse than their best attainable performance on benign inputs (standard generalization). It is thus imperative to understand how overparametrization fundamentally affects robustness. In this paper, we will provide a precise characterization of the role of overparametrization on robustness by focusing on random features regression models (two-layer neural networks with random first layer weights). We consider a regime where the sample size, the input dimension and the number of parameters grow in proportion to each other, and derive an asymptotically exact formula for the robust generalization error when the model is adversarially trained. Our developed theory reveals the nontrivial effect of overparametrization on robustness and indicates that for adversarially trained random features models, high overparametrization can hurt robust generalization.
    Quasi-Framelets: Another Improvement to GraphNeural Networks. (arXiv:2201.04728v1 [cs.LG])
    (2 min) This paper aims to provide a novel design of a multiscale framelets convolution for spectral graph neural networks. In the spectral paradigm, spectral GNNs improve graph learning task performance via proposing various spectral filters in spectral domain to capture both global and local graph structure information. Although the existing spectral approaches show superior performance in some graphs, they suffer from lack of flexibility and being fragile when graph information are incomplete or perturbated. Our new framelets convolution incorporates the filtering func-tions directly designed in the spectral domain to overcome these limitations. The proposed convolution shows a great flexibility in cutting-off spectral information and effectively mitigate the negative effect of noisy graph signals. Besides, to exploit the heterogeneity in real-world graph data, the heterogeneous graph neural network with our new framelet convolution provides a solution for embedding the intrinsic topological information of meta-path with a multi-level graph analysis.Extensive experiments have been conducted on real-world heterogeneous graphs and homogeneous graphs under settings with noisy node features and superior performance results are achieved.
    Fish sounds: towards the evaluation of marine acoustic biodiversity through data-driven audio source separation. (arXiv:2201.05013v1 [cs.SD])
    (2 min) The marine ecosystem is changing at an alarming rate, exhibiting biodiversity loss and the migration of tropical species to temperate basins. Monitoring the underwater environments and their inhabitants is of fundamental importance to understand the evolution of these systems and implement safeguard policies. However, assessing and tracking biodiversity is often a complex task, especially in large and uncontrolled environments, such as the oceans. One of the most popular and effective methods for monitoring marine biodiversity is passive acoustics monitoring (PAM), which employs hydrophones to capture underwater sound. Many aquatic animals produce sounds characteristic of their own species; these signals travel efficiently underwater and can be detected even at great distances. Furthermore, modern technologies are becoming more and more convenient and precise, allowing for very accurate and careful data acquisition. To date, audio captured with PAM devices is frequently manually processed by marine biologists and interpreted with traditional signal processing techniques for the detection of animal vocalizations. This is a challenging task, as PAM recordings are often over long periods of time. Moreover, one of the causes of biodiversity loss is sound pollution; in data obtained from regions with loud anthropic noise, it is hard to separate the artificial from the fish sound manually. Nowadays, machine learning and, in particular, deep learning represents the state of the art for processing audio signals. Specifically, sound separation networks are able to identify and separate human voices and musical instruments. In this work, we show that the same techniques can be successfully used to automatically extract fish vocalizations in PAM recordings, opening up the possibility for biodiversity monitoring at a large scale.
    AequeVox: Automated Fairness Testing of Speech Recognition Systems. (arXiv:2110.09843v2 [cs.LG] UPDATED)
    (2 min) Automatic Speech Recognition (ASR) systems have become ubiquitous. They can be found in a variety of form factors and are increasingly important in our daily lives. As such, ensuring that these systems are equitable to different subgroups of the population is crucial. In this paper, we introduce, AequeVox, an automated testing framework for evaluating the fairness of ASR systems. AequeVox simulates different environments to assess the effectiveness of ASR systems for different populations. In addition, we investigate whether the chosen simulations are comprehensible to humans. We further propose a fault localization technique capable of identifying words that are not robust to these varying environments. Both components of AequeVox are able to operate in the absence of ground truth data. We evaluated AequeVox on speech from four different datasets using three different commercial ASRs. Our experiments reveal that non-native English, female and Nigerian English speakers generate 109%, 528.5% and 156.9% more errors, on average than native English, male and UK Midlands speakers, respectively. Our user study also reveals that 82.9% of the simulations (employed through speech transformations) had a comprehensibility rating above seven (out of ten), with the lowest rating being 6.78. This further validates the fairness violations discovered by AequeVox. Finally, we show that the non-robust words, as predicted by the fault localization technique embodied in AequeVox, show 223.8% more errors than the predicted robust words across all ASRs.
    How Tight Can PAC-Bayes be in the Small Data Regime?. (arXiv:2106.03542v4 [stat.ML] UPDATED)
    (2 min) In this paper, we investigate the question: Given a small number of datapoints, for example N = 30, how tight can PAC-Bayes and test set bounds be made? For such small datasets, test set bounds adversely affect generalisation performance by withholding data from the training procedure. In this setting, PAC-Bayes bounds are especially attractive, due to their ability to use all the data to simultaneously learn a posterior and bound its generalisation risk. We focus on the case of i.i.d. data with a bounded loss and consider the generic PAC-Bayes theorem of Germain et al. While their theorem is known to recover many existing PAC-Bayes bounds, it is unclear what the tightest bound derivable from their framework is. For a fixed learning algorithm and dataset, we show that the tightest possible bound coincides with a bound considered by Catoni; and, in the more natural case of distributions over datasets, we establish a lower bound on the best bound achievable in expectation. Interestingly, this lower bound recovers the Chernoff test set bound if the posterior is equal to the prior. Moreover, to illustrate how tight these bounds can be, we study synthetic one-dimensional classification tasks in which it is feasible to meta-learn both the prior and the form of the bound to numerically optimise for the tightest bounds possible. We find that in this simple, controlled scenario, PAC-Bayes bounds are competitive with comparable, commonly used Chernoff test set bounds. However, the sharpest test set bounds still lead to better guarantees on the generalisation error than the PAC-Bayes bounds we consider.
    GradMax: Growing Neural Networks using Gradient Information. (arXiv:2201.05125v1 [cs.LG])
    (2 min) The architecture and the parameters of neural networks are often optimized independently, which requires costly retraining of the parameters whenever the architecture is modified. In this work we instead focus on growing the architecture without requiring costly retraining. We present a method that adds new neurons during training without impacting what is already learned, while improving the training dynamics. We achieve the latter by maximizing the gradients of the new weights and find the optimal initialization efficiently by means of the singular value decomposition (SVD). We call this technique Gradient Maximizing Growth (GradMax) and demonstrate its effectiveness in variety of vision tasks and architectures.
    Distributed Cooperative Multi-Agent Reinforcement Learning with Directed Coordination Graph. (arXiv:2201.04962v1 [cs.MA])
    (2 min) Existing distributed cooperative multi-agent reinforcement learning (MARL) frameworks usually assume undirected coordination graphs and communication graphs while estimating a global reward via consensus algorithms for policy evaluation. Such a framework may induce expensive communication costs and exhibit poor scalability due to requirement of global consensus. In this work, we study MARLs with directed coordination graphs, and propose a distributed RL algorithm where the local policy evaluations are based on local value functions. The local value function of each agent is obtained by local communication with its neighbors through a directed learning-induced communication graph, without using any consensus algorithm. A zeroth-order optimization (ZOO) approach based on parameter perturbation is employed to achieve gradient estimation. By comparing with existing ZOO-based RL algorithms, we show that our proposed distributed RL algorithm guarantees high scalability. A distributed resource allocation example is shown to illustrate the effectiveness of our algorithm.
    Continuous Herded Gibbs Sampling. (arXiv:2106.06430v2 [stat.ML] UPDATED)
    (2 min) Herding is a technique to sequentially generate deterministic samples from a probability distribution. In this work, we propose a continuous herded Gibbs sampler that combines kernel herding on continuous densities with the Gibbs sampling idea. Our algorithm allows for deterministically sampling from high-dimensional multivariate probability densities, without directly sampling from the joint density. Experiments with Gaussian mixture densities indicate that the L2 error decreases similarly to kernel herding, while the computation time is significantly lower, i.e., linear in the number of dimensions.
    Discrete-time Contraction-based Control of Nonlinear Systems with Parametric Uncertainties using Neural Networks. (arXiv:2105.05432v2 [eess.SY] UPDATED)
    (2 min) In response to the continuously changing feedstock supply and market demand for products with different specifications, the processes need to be operated at time-varying operating conditions and targets (e.g., setpoints) to improve the process economy, in contrast to traditional process operations around predetermined equilibriums. In this paper, a contraction theory-based control approach using neural networks is developed for nonlinear chemical processes to achieve time-varying reference tracking. This approach leverages the universal approximation characteristics of neural networks with discrete-time contraction analysis and control. It involves training a neural network to learn a contraction metric and differential feedback gain, that is embedded in a contraction-based controller. A second, separate neural network is also incorporated into the control-loop to perform online learning of uncertain system model parameters. The resulting control scheme is capable of achieving efficient offset-free tracking of time-varying references, with a full range of model uncertainty, without the need for controller structure redesign as the reference changes. This is a robust approach that can deal with bounded parametric uncertainties in the process model, which are commonly encountered in industrial (chemical) processes. This approach also ensures the process stability during online simultaneous learning and control. Simulation examples are provided to illustrate the above approach.
    Planning in Observable POMDPs in Quasipolynomial Time. (arXiv:2201.04735v1 [cs.LG])
    (2 min) Partially Observable Markov Decision Processes (POMDPs) are a natural and general model in reinforcement learning that take into account the agent's uncertainty about its current state. In the literature on POMDPs, it is customary to assume access to a planning oracle that computes an optimal policy when the parameters are known, even though the problem is known to be computationally hard. Almost all existing planning algorithms either run in exponential time, lack provable performance guarantees, or require placing strong assumptions on the transition dynamics under every possible policy. In this work, we revisit the planning problem and ask: are there natural and well-motivated assumptions that make planning easy? Our main result is a quasipolynomial-time algorithm for planning in (one-step) observable POMDPs. Specifically, we assume that well-separated distributions on states lead to well-separated distributions on observations, and thus the observations are at least somewhat informative in each step. Crucially, this assumption places no restrictions on the transition dynamics of the POMDP; nevertheless, it implies that near-optimal policies admit quasi-succinct descriptions, which is not true in general (under standard hardness assumptions). Our analysis is based on new quantitative bounds for filter stability -- i.e. the rate at which an optimal filter for the latent state forgets its initialization. Furthermore, we prove matching hardness for planning in observable POMDPs under the Exponential Time Hypothesis.
    Low-Dimensional Structure in the Space of Language Representations is Reflected in Brain Responses. (arXiv:2106.05426v4 [cs.CL] UPDATED)
    (0 min) How related are the representations learned by neural language models, translation models, and language tagging tasks? We answer this question by adapting an encoder-decoder transfer learning method from computer vision to investigate the structure among 100 different feature spaces extracted from hidden representations of various networks trained on language tasks. This method reveals a low-dimensional structure where language models and translation models smoothly interpolate between word embeddings, syntactic and semantic tasks, and future word embeddings. We call this low-dimensional structure a language representation embedding because it encodes the relationships between representations needed to process language for a variety of NLP tasks. We find that this representation embedding can predict how well each individual feature space maps to human brain responses to natural language stimuli recorded using fMRI. Additionally, we find that the principal dimension of this structure can be used to create a metric which highlights the brain's natural language processing hierarchy. This suggests that the embedding captures some part of the brain's natural language representation structure.
    Partial Recovery in the Graph Alignment Problem. (arXiv:2007.00533v5 [stat.ML] UPDATED)
    (0 min) In this paper, we consider the graph alignment problem, which is the problem of recovering, given two graphs, a one-to-one mapping between nodes that maximizes edge overlap. This problem can be viewed as a noisy version of the well-known graph isomorphism problem and appears in many applications, including social network deanonymization and cellular biology. Our focus here is on partial recovery, i.e., we look for a one-to-one mapping which is correct on a fraction of the nodes of the graph rather than on all of them, and we assume that the two input graphs to the problem are correlated Erd\H{o}s-R\'enyi graphs of parameters $(n,q,s)$. Our main contribution is then to give necessary and sufficient conditions on $(n,q,s)$ under which partial recovery is possible with high probability as the number of nodes $n$ goes to infinity. In particular, we show that it is possible to achieve partial recovery in the $nqs=\Theta(1)$ regime under certain additional assumptions.
    Flood Prediction and Analysis on the Relevance of Features using Explainable Artificial Intelligence. (arXiv:2201.05046v1 [cs.LG])
    (0 min) This paper presents flood prediction models for the state of Kerala in India by analyzing the monthly rainfall data and applying machine learning algorithms including Logistic Regression, K-Nearest Neighbors, Decision Trees, Random Forests, and Support Vector Machine. Although these models have shown high accuracy prediction of the occurrence of flood in a particular year, they do not quantitatively and qualitatively explain the prediction decision. This paper shows how the background features are learned that contributed to the prediction decision and further extended to explain the inner workings with the development of explainable artificial intelligence modules. The obtained results have confirmed the validity of the findings uncovered by the explainer modules basing on the historical flood monthly rainfall data in Kerala.
    An Overview of Uncertainty Quantification Methods for Infinite Neural Networks. (arXiv:2201.04746v1 [cs.LG])
    (0 min) To better understand the theoretical behavior of large neural networks, several works have analyzed the case where a network's width tends to infinity. In this regime, the effect of random initialization and the process of training a neural network can be formally expressed with analytical tools like Gaussian processes and neural tangent kernels. In this paper, we review methods for quantifying uncertainty in such infinite-width neural networks and compare their relationship to Gaussian processes in the Bayesian inference framework. We make use of several equivalence results along the way to obtain exact closed-form solutions for predictive uncertainty.
    Class-Balanced Distillation for Long-Tailed Visual Recognition. (arXiv:2104.05279v2 [cs.CV] UPDATED)
    (0 min) Real-world imagery is often characterized by a significant imbalance of the number of images per class, leading to long-tailed distributions. An effective and simple approach to long-tailed visual recognition is to learn feature representations and a classifier separately, with instance and class-balanced sampling, respectively. In this work, we introduce a new framework, by making the key observation that a feature representation learned with instance sampling is far from optimal in a long-tailed setting. Our main contribution is a new training method, referred to as Class-Balanced Distillation (CBD), that leverages knowledge distillation to enhance feature representations. CBD allows the feature representation to evolve in the second training stage, guided by the teacher learned in the first stage. The second stage uses class-balanced sampling, in order to focus on under-represented classes. This framework can naturally accommodate the usage of multiple teachers, unlocking the information from an ensemble of models to enhance recognition capabilities. Our experiments show that the proposed technique consistently outperforms the state of the art on long-tailed recognition benchmarks such as ImageNet-LT, iNaturalist17 and iNaturalist18.
    On Stochastic Moving-Average Estimators for Non-Convex Optimization. (arXiv:2104.14840v3 [math.OC] UPDATED)
    (0 min) In this paper, we consider the widely used but not fully understood stochastic estimator based on moving average (SEMA), which only requires {\bf a general unbiased stochastic oracle}. We demonstrate the power of SEMA on a range of stochastic non-convex optimization problems. In particular, we analyze various stochastic methods (existing or newly proposed) based on the {\bf variance recursion property} of SEMA for three families of non-convex optimization, namely standard stochastic non-convex minimization, stochastic non-convex strongly-concave min-max optimization, and stochastic bilevel optimization. Our contributions include: (i) for standard stochastic non-convex minimization, we present a simple and intuitive proof of convergence for a family of Adam-style methods (including Adam, AMSGrad, AdaBound, etc.) with an increasing or large "momentum" parameter for the first-order moment, which gives an alternative yet more natural way to guarantee Adam converge; (ii) for stochastic non-convex strongly-concave min-max optimization, we present a single-loop primal-dual stochastic momentum and adaptive methods based on the moving average estimators and establish its oracle complexity of $O(1/\epsilon^4)$ without using a large mini-batch size, addressing a gap in the literature; (iii) for stochastic bilevel optimization, we present a single-loop stochastic method based on the moving average estimators and establish its oracle complexity of $\widetilde O(1/\epsilon^4)$ without computing the SVD of the Hessian matrix, improving state-of-the-art results. For all these problems, we also establish a variance diminishing result for the used stochastic gradient estimators.
    Hyperparameter Importance for Machine Learning Algorithms. (arXiv:2201.05132v1 [stat.ML])
    (0 min) Hyperparameter plays an essential role in the fitting of supervised machine learning algorithms. However, it is computationally expensive to tune all the tunable hyperparameters simultaneously especially for large data sets. In this paper, we give a definition of hyperparameter importance that can be estimated by subsampling procedures. According to the importance, hyperparameters can then be tuned on the entire data set more efficiently. We show theoretically that the proposed importance on subsets of data is consistent with the one on the population data under weak conditions. Numerical experiments show that the proposed importance is consistent and can save a lot of computational resources.
    Fractal Gaussian Networks: A sparse random graph model based on Gaussian Multiplicative Chaos. (arXiv:2008.03038v2 [stat.ML] UPDATED)
    (0 min) We propose a novel stochastic network model, called Fractal Gaussian Network (FGN), that embodies well-defined and analytically tractable fractal structures. Such fractal structures have been empirically observed in diverse applications. FGNs interpolate continuously between the popular purely random geometric graphs (a.k.a. the Poisson Boolean network), and random graphs with increasingly fractal behavior. In fact, they form a parametric family of sparse random geometric graphs that are parametrized by a fractality parameter which governs the strength of the fractal structure. FGNs are driven by the latent spatial geometry of Gaussian Multiplicative Chaos (GMC), a canonical model of fractality in its own right. We asymptotically characterize the expected number of edges, triangles, cliques and hub-and-spoke motifs in FGNs, unveiling a distinct pattern in their scaling with the size parameter of the network. We then examine the natural question of detecting the presence of fractality and the problem of parameter estimation based on observed network data, in addition to fundamental properties of the FGN as a random graph model. We also explore fractality in community structures by unveiling a natural stochastic block model in the setting of FGNs. Finally, we substantiate our results with phenomenological analysis of the FGN in the context of available scientific literature for fractality in networks, including applications to real-world massive network data.
    Deploying clinical machine learning? Consider the following.... (arXiv:2109.06919v2 [cs.LG] UPDATED)
    (0 min) Despite the intense attention and considerable investment into clinical machine learning research, relatively few applications have been deployed at a large-scale in a real-world clinical environment. While research is important in advancing the state-of-the-art, translation is equally important in bringing these techniques and technologies into a position to ultimately impact healthcare. We believe a lack of appreciation for several considerations are a major cause for this discrepancy between expectation and reality. To better characterize a holistic perspective among researchers and practitioners, we survey several practitioners with commercial experience in developing CML for clinical deployment. Using these insights, we identify several main categories of challenges in order to better design and develop clinical machine learning applications.
    Transferable Time-Series Forecasting under Causal Conditional Shift. (arXiv:2111.03422v2 [cs.LG] UPDATED)
    (0 min) This paper focuses on the problem of semi-supervised domain adaptation for time-series forecasting, which is underexplored in literatures, despite being often encountered in practice. Existing methods on time-series domain adaptation mainly follow the paradigm designed for the static data, which cannot handle domain-specific complex conditional dependencies raised by data offset, time lags, and variant data distributions. In order to address these challenges, we analyze variational conditional dependencies in time-series data and find that the causal structures are usually stable among domains, and further raise the causal conditional shift assumption. Enlightened by this assumption, we consider the causal generation process for time-series data and propose an end-to-end model for the semi-supervised domain adaptation problem on time-series forecasting. Our method can not only discover the Granger-Causal structures among cross-domain data but also address the cross-domain time-series forecasting problem with accurate and interpretable predicted results. We further theoretically analyze the superiority of the proposed method, where the generalization error on the target domain is bounded by the empirical risks and by the discrepancy between the causal structures from different domains. Experimental results on both synthetic and real data demonstrate the effectiveness of our method for the semi-supervised domain adaptation method on time-series forecasting.
    Rate Distortion Characteristic Modeling for Neural Image Compression. (arXiv:2106.12954v2 [eess.IV] UPDATED)
    (0 min) End-to-end optimized neural image compression (NIC) has obtained superior lossy compression performance recently. In this paper, we consider the problem of rate-distortion (R-D) characteristic analysis and modeling for NIC. We make efforts to formulate the essential mathematical functions to describe the R-D behavior of NIC using deep networks. Thus arbitrary bit-rate points could be elegantly realized by leveraging such model via a single trained network. We propose a plugin-in module to learn the relationship between the target bit-rate and the binary representation for the latent variable of auto-encoder. The proposed scheme resolves the problem of training distinct models to reach different points in the R-D space. Furthermore, we model the rate and distortion characteristic of NIC as a function of the coding parameter $\lambda$ respectively. Our experiments show our proposed method is easy to adopt and realizes state-of-the-art continuous bit-rate coding performance, which implies that our approach would benefit the practical deployment of NIC.
    REST: Debiased Social Recommendation via Reconstructing Exposure Strategies. (arXiv:2201.04952v1 [cs.LG])
    (0 min) The recommendation system, relying on historical observational data to model the complex relationships among the users and items, has achieved great success in real-world applications. Selection bias is one of the most important issues of the existing observational data based approaches, which is actually caused by multiple types of unobserved exposure strategies (e.g. promotions and holiday effects). Though various methods have been proposed to address this problem, they are mainly relying on the implicit debiasing techniques but not explicitly modeling the unobserved exposure strategies. By explicitly Reconstructing Exposure STrategies (REST in short), we formalize the recommendation problem as the counterfactual reasoning and propose the debiased social recommendation method. In REST, we assume that the exposure of an item is controlled by the latent exposure strategies, the user, and the item. Based on the above generation process, we first provide the theoretical guarantee of our method via identification analysis. Second, we employ a variational auto-encoder to reconstruct the latent exposure strategies, with the help of the social networks and the items. Third, we devise a counterfactual reasoning based recommendation algorithm by leveraging the recovered exposure strategies. Experiments on four real-world datasets, including three published datasets and one private WeChat Official Account dataset, demonstrate significant improvements over several state-of-the-art methods.
    Unifying Epidemic Models with Mixtures. (arXiv:2201.04960v1 [stat.ML])
    (0 min) The COVID-19 pandemic has emphasized the need for a robust understanding of epidemic models. Current models of epidemics are classified as either mechanistic or non-mechanistic: mechanistic models make explicit assumptions on the dynamics of disease, whereas non-mechanistic models make assumptions on the form of observed time series. Here, we introduce a simple mixture-based model which bridges the two approaches while retaining benefits of both. The model represents time series of cases and fatalities as a mixture of Gaussian curves, providing a flexible function class to learn from data compared to traditional mechanistic models. Although the model is non-mechanistic, we show that it arises as the natural outcome of a stochastic process based on a networked SIR framework. This allows learned parameters to take on a more meaningful interpretation compared to similar non-mechanistic models, and we validate the interpretations using auxiliary mobility data collected during the COVID-19 pandemic. We provide a simple learning algorithm to identify model parameters and establish theoretical results which show the model can be efficiently learned from data. Empirically, we find the model to have low prediction error. The model is available live at covidpredictions.mit.edu. Ultimately, this allows us to systematically understand the impacts of interventions on COVID-19, which is critical in developing data-driven solutions to controlling epidemics.
    Adherence Forecasting for Guided Internet-Delivered Cognitive Behavioral Therapy: A Minimally Data-Sensitive Approach. (arXiv:2201.04967v1 [cs.LG])
    (0 min) Internet-delivered psychological treatments (IDPT) are seen as an effective and scalable pathway to improving the accessibility of mental healthcare. Within this context, treatment adherence is an especially relevant challenge to address due to the reduced interaction between healthcare professionals and patients, compared to more traditional interventions. In parallel, there are increasing regulations when using peoples' personal data, especially in the digital sphere. In such regulations, data minimization is often a core tenant such as within the General Data Protection Regulation (GDPR). Consequently, this work proposes a deep-learning approach to perform automatic adherence forecasting, while only relying on minimally sensitive login/logout data. This approach was tested on a dataset containing 342 patients undergoing guided internet-delivered cognitive behavioral therapy (G-ICBT) treatment. The proposed Self-Attention Network achieved over 70% average balanced accuracy, when only 1/3 of the treatment duration had elapsed. As such, this study demonstrates that automatic adherence forecasting for G-ICBT, is achievable using only minimally sensitive data, thus facilitating the implementation of such tools within real-world IDPT platforms.
    Deep Learning on Multimodal Sensor Data at the Wireless Edge for Vehicular Network. (arXiv:2201.04712v1 [cs.LG])
    (0 min) Beam selection for millimeter-wave links in a vehicular scenario is a challenging problem, as an exhaustive search among all candidate beam pairs cannot be assuredly completed within short contact times. We solve this problem via a novel expediting beam selection by leveraging multimodal data collected from sensors like LiDAR, camera images, and GPS. We propose individual modality and distributed fusion-based deep learning (F-DL) architectures that can execute locally as well as at a mobile edge computing center (MEC), with a study on associated tradeoffs. We also formulate and solve an optimization problem that considers practical beam-searching, MEC processing and sensor-to-MEC data delivery latency overheads for determining the output dimensions of the above F-DL architectures. Results from extensive evaluations conducted on publicly available synthetic and home-grown real-world datasets reveal 95% and 96% improvement in beam selection speed over classical RF-only beam sweeping, respectively. F-DL also outperforms the state-of-the-art techniques by 20-22% in predicting top-10 best beam pairs.
    Using Computational Intelligence for solving the Ornstein-Zernike equation. (arXiv:2201.05089v1 [cond-mat.soft])
    (0 min) The main goal of this thesis is to provide an exploration of the use of computational intelligence techniques to study the numerical solution of the Ornstein-Zernike equation for simple liquids. In particular, a continuous model of the hard sphere fluid is studied. There are two main proposals in this contribution. First, the use of neural networks as a way to parametrize closure relation when solving the Ornstein-Zernike equation. It is explicitly shown that in the case of the hard sphere fluid, the neural network approach seems to reduce to the so-called Hypernetted Chain closure. For the second proposal, we explore the fact that if more physical information is incorporated into the theoretical formalism, a better estimate can be obtained with the use of evolutionary optimization techniques. When choosing the modified Verlet closure relation, and leaving a couple of free parameters to be adjusted, the results are as good as those obtained from molecular simulations. The thesis is then closed with a brief summary of the main findings and outlooks on different ways to improve the proposals presented here.
    A Method for Estimating the Entropy of Time Series Using Artificial Neural Networks. (arXiv:2107.08399v4 [cs.LG] UPDATED)
    (0 min) Measuring the predictability and complexity of time series using entropy is essential tool de-signing and controlling a nonlinear system. However, the existing methods have some drawbacks related to the strong dependence of entropy on the parameters of the methods. To overcome these difficulties, this study proposes a new method for estimating the entropy of a time series using the LogNNet neural network model. The LogNNet reservoir matrix is filled with time series elements according to our algorithm. The accuracy of the classification of images from the MNIST-10 database is considered as the entropy measure and denoted by NNetEn. The novelty of entropy calculation is that the time series is involved in mixing the input information in the res-ervoir. Greater complexity in the time series leads to a higher classification accuracy and higher NNetEn values. We introduce a new time series characteristic called time series learning inertia that determines the learning rate of the neural network. The robustness and efficiency of the method is verified on chaotic, periodic, random, binary, and constant time series. The comparison of NNetEn with other methods of entropy estimation demonstrates that our method is more robust and accurate and can be widely used in practice.
    Data-Driven Modeling and Prediction of Non-Linearizable Dynamics via Spectral Submanifolds. (arXiv:2201.04976v1 [math.DS])
    (0 min) We develop a methodology to construct low-dimensional predictive models from data sets representing essentially nonlinear (or non-linearizable) dynamical systems with a hyperbolic linear part that are subject to external forcing with finitely many frequencies. Our data-driven, sparse, nonlinear models are obtained as extended normal forms of the reduced dynamics on low-dimensional, attracting spectral submanifolds (SSMs) of the dynamical system. We illustrate the power of data-driven SSM reduction on high-dimensional numerical data sets and experimental measurements involving beam oscillations, vortex shedding and sloshing in a water tank. We find that SSM reduction trained on unforced data also predicts nonlinear response accurately under additional external forcing.
    Sparse Spiking Gradient Descent. (arXiv:2105.08810v2 [cs.NE] UPDATED)
    (0 min) There is an increasing interest in emulating Spiking Neural Networks (SNNs) on neuromorphic computing devices due to their low energy consumption. Recent advances have allowed training SNNs to a point where they start to compete with traditional Artificial Neural Networks (ANNs) in terms of accuracy, while at the same time being energy efficient when run on neuromorphic hardware. However, the process of training SNNs is still based on dense tensor operations originally developed for ANNs which do not leverage the spatiotemporally sparse nature of SNNs. We present here the first sparse SNN backpropagation algorithm which achieves the same or better accuracy as current state of the art methods while being significantly faster and more memory efficient. We show the effectiveness of our method on real datasets of varying complexity (Fashion-MNIST, Neuromophic-MNIST and Spiking Heidelberg Digits) achieving a speedup in the backward pass of up to 150x, and 85% more memory efficient, without losing accuracy.
    A robust kernel machine regression towards biomarker selection in multi-omics datasets of osteoporosis for drug discovery. (arXiv:2201.05060v1 [stat.ML])
    (0 min) Many statistical machine approaches could ultimately highlight novel features of the etiology of complex diseases by analyzing multi-omics data. However, they are sensitive to some deviations in distribution when the observed samples are potentially contaminated with adversarial corrupted outliers (e.g., a fictional data distribution). Likewise, statistical advances lag in supporting comprehensive data-driven analyses of complex multi-omics data integration. We propose a novel non-linear M-estimator-based approach, "robust kernel machine regression (RobKMR)," to improve the robustness of statistical machine regression and the diversity of fictional data to examine the higher-order composite effect of multi-omics datasets. We address a robust kernel-centered Gram matrix to estimate the model parameters accurately. We also propose a robust score test to assess the marginal and joint Hadamard product of features from multi-omics data. We apply our proposed approach to a multi-omics dataset of osteoporosis (OP) from Caucasian females. Experiments demonstrate that the proposed approach effectively identifies the inter-related risk factors of OP. With solid evidence (p-value = 0.00001), biological validations, network-based analysis, causal inference, and drug repurposing, the selected three triplets ((DKK1, SMTN, DRGX), (MTND5, FASTKD2, CSMD3), (MTND5, COG3, CSMD3)) are significant biomarkers and directly relate to BMD. Overall, the top three selected genes (DKK1, MTND5, FASTKD2) and one gene (SIDT1 at p-value= 0.001) significantly bond with four drugs- Tacrolimus, Ibandronate, Alendronate, and Bazedoxifene out of 30 candidates for drug repurposing in OP. Further, the proposed approach can be applied to any disease model where multi-omics datasets are available.
    Weakly Supervised Scene Text Detection using Deep Reinforcement Learning. (arXiv:2201.04866v1 [cs.CV])
    (0 min) The challenging field of scene text detection requires complex data annotation, which is time-consuming and expensive. Techniques, such as weak supervision, can reduce the amount of data needed. In this paper we propose a weak supervision method for scene text detection, which makes use of reinforcement learning (RL). The reward received by the RL agent is estimated by a neural network, instead of being inferred from ground-truth labels. First, we enhance an existing supervised RL approach to text detection with several training optimizations, allowing us to close the performance gap to regression-based algorithms. We then use our proposed system in a weakly- and semi-supervised training on real-world data. Our results show that training in a weakly supervised setting is feasible. However, we find that using our model in a semi-supervised setting , e.g. when combining labeled synthetic data with unannotated real-world data, produces the best results.
    Safe Policies for Reinforcement Learning via Primal-Dual Methods. (arXiv:1911.09101v2 [eess.SY] UPDATED)
    (0 min) In this paper, we study the learning of safe policies in the setting of reinforcement learning problems. This is, we aim to control a Markov Decision Process (MDP) of which we do not know the transition probabilities, but we have access to sample trajectories through experience. We define safety as the agent remaining in a desired safe set with high probability during the operation time. We therefore consider a constrained MDP where the constraints are probabilistic. Since there is no straightforward way to optimize the policy with respect to the probabilistic constraint in a reinforcement learning framework, we propose an ergodic relaxation of the problem. The advantages of the proposed relaxation are threefold. (i) The safety guarantees are maintained in the case of episodic tasks and they are kept up to a given time horizon for continuing tasks. (ii) The constrained optimization problem despite its non-convexity has arbitrarily small duality gap if the parametrization of the policy is rich enough. (iii) The gradients of the Lagrangian associated with the safe-learning problem can be easily computed using standard policy gradient results and stochastic approximation tools. Leveraging these advantages, we establish that primal-dual algorithms are able to find policies that are safe and optimal. We test the proposed approach in a navigation task in a continuous domain. The numerical results show that our algorithm is capable of dynamically adapting the policy to the environment and the required safety levels.
    A Geometric Approach to $k$-means. (arXiv:2201.04822v1 [cs.LG])
    (0 min) $k$-means clustering is a fundamental problem in various disciplines. This problem is nonconvex, and standard algorithms are only guaranteed to find a local optimum. Leveraging the structure of local solutions characterized in [1], we propose a general algorithmic framework for escaping undesirable local solutions and recovering the global solution (or the ground truth). This framework consists of alternating between the following two steps iteratively: (i) detect mis-specified clusters in a local solution and (ii) improve the current local solution by non-local operations. We discuss implementation of these steps, and elucidate how the proposed framework unifies variants of $k$-means algorithm in literature from a geometric perspective. In addition, we introduce two natural extensions of the proposed framework, where the initial number of clusters is misspecified. We provide theoretical justification for our approach, which is corroborated with extensive experiments.
    The Recurrent Reinforcement Learning Crypto Agent. (arXiv:2201.04699v1 [cs.LG])
    (0 min) We demonstrate an application of online transfer learning as a digital assets trading agent. This agent makes use of a powerful feature space representation in the form of an echo state network, the output of which is made available to a direct, recurrent reinforcement learning agent. The agent learns to trade the XBTUSD (Bitcoin versus US dollars) perpetual swap derivatives contract on BitMEX. It learns to trade intraday on five minutely sampled data, avoids excessive over-trading, captures a funding profit and is also able to predict the direction of the market. Overall, our crypto agent realises a total return of 350%, net of transaction costs, over roughly five years, 71% of which is down to funding profit. The annualised information ratio that it achieves is 1.46.
    Local2Global: A distributed approach for scaling representation learning on graphs. (arXiv:2201.04729v1 [cs.LG])
    (0 min) We propose a decentralised "local2global"' approach to graph representation learning, that one can a-priori use to scale any embedding technique. Our local2global approach proceeds by first dividing the input graph into overlapping subgraphs (or "patches") and training local representations for each patch independently. In a second step, we combine the local representations into a globally consistent representation by estimating the set of rigid motions that best align the local representations using information from the patch overlaps, via group synchronization. A key distinguishing feature of local2global relative to existing work is that patches are trained independently without the need for the often costly parameter synchronization during distributed training. This allows local2global to scale to large-scale industrial applications, where the input graph may not even fit into memory and may be stored in a distributed manner. We apply local2global on data sets of different sizes and show that our approach achieves a good trade-off between scale and accuracy on edge reconstruction and semi-supervised classification. We also consider the downstream task of anomaly detection and show how one can use local2global to highlight anomalies in cybersecurity networks.
    Privacy Amplification by Subsampling in Time Domain. (arXiv:2201.04762v1 [cs.CR])
    (0 min) Aggregate time-series data like traffic flow and site occupancy repeatedly sample statistics from a population across time. Such data can be profoundly useful for understanding trends within a given population, but also pose a significant privacy risk, potentially revealing e.g., who spends time where. Producing a private version of a time-series satisfying the standard definition of Differential Privacy (DP) is challenging due to the large influence a single participant can have on the sequence: if an individual can contribute to each time step, the amount of additive noise needed to satisfy privacy increases linearly with the number of time steps sampled. As such, if a signal spans a long duration or is oversampled, an excessive amount of noise must be added, drowning out underlying trends. However, in many applications an individual realistically cannot participate at every time step. When this is the case, we observe that the influence of a single participant (sensitivity) can be reduced by subsampling and/or filtering in time, while still meeting privacy requirements. Using a novel analysis, we show this significant reduction in sensitivity and propose a corresponding class of privacy mechanisms. We demonstrate the utility benefits of these techniques empirically with real-world and synthetic time-series data.
    A Survey on Masked Facial Detection Methods and Datasets for Fighting Against COVID-19. (arXiv:2201.04777v1 [cs.CV])
    (0 min) Coronavirus disease 2019 (COVID-19) continues to pose a great challenge to the world since its outbreak. To fight against the disease, a series of artificial intelligence (AI) techniques are developed and applied to real-world scenarios such as safety monitoring, disease diagnosis, infection risk assessment, lesion segmentation of COVID-19 CT scans,etc. The coronavirus epidemics have forced people wear masks to counteract the transmission of virus, which also brings difficulties to monitor large groups of people wearing masks. In this paper, we primarily focus on the AI techniques of masked facial detection and related datasets. We survey the recent advances, beginning with the descriptions of masked facial detection datasets. Thirteen available datasets are described and discussed in details. Then, the methods are roughly categorized into two classes: conventional methods and neural network-based methods. Conventional methods are usually trained by boosting algorithms with hand-crafted features, which accounts for a small proportion. Neural network-based methods are further classified as three parts according to the number of processing stages. Representative algorithms are described in detail, coupled with some typical techniques that are described briefly. Finally, we summarize the recent benchmarking results, give the discussions on the limitations of datasets and methods, and expand future research directions. To our knowledge, this is the first survey about masked facial detection methods and datasets. Hopefully our survey could provide some help to fight against epidemics.
    Self-semantic contour adaptation for cross modality brain tumor segmentation. (arXiv:2201.05022v1 [cs.CV])
    (0 min) Unsupervised domain adaptation (UDA) between two significantly disparate domains to learn high-level semantic alignment is a crucial yet challenging task.~To this end, in this work, we propose exploiting low-level edge information to facilitate the adaptation as a precursor task, which has a small cross-domain gap, compared with semantic segmentation.~The precise contour then provides spatial information to guide the semantic adaptation. More specifically, we propose a multi-task framework to learn a contouring adaptation network along with a semantic segmentation adaptation network, which takes both magnetic resonance imaging (MRI) slice and its initial edge map as input.~These two networks are jointly trained with source domain labels, and the feature and edge map level adversarial learning is carried out for cross-domain alignment. In addition, self-entropy minimization is incorporated to further enhance segmentation performance. We evaluated our framework on the BraTS2018 database for cross-modality segmentation of brain tumors, showing the validity and superiority of our approach, compared with competing methods.
    CLSRIL-23: Cross Lingual Speech Representations for Indic Languages. (arXiv:2107.07402v2 [cs.CL] UPDATED)
    (0 min) We present a CLSRIL-23, a self supervised learning based audio pre-trained model which learns cross lingual speech representations from raw audio across 23 Indic languages. It is built on top of wav2vec 2.0 which is solved by training a contrastive task over masked latent speech representations and jointly learns the quantization of latents shared across all languages. We compare the language wise loss during pretraining to compare effects of monolingual and multilingual pretraining. Performance on some downstream fine-tuning tasks for speech recognition is also compared and our experiments show that multilingual pretraining outperforms monolingual training, in terms of learning speech representations which encodes phonetic similarity of languages and also in terms of performance on down stream tasks. A decrease of 5% is observed in WER and 9.5% in CER when a multilingual pretrained model is used for finetuning in Hindi. All the code models are also open sourced. CLSRIL-23 is a model trained on $23$ languages and almost 10,000 hours of audio data to facilitate research in speech recognition for Indic languages. We hope that new state of the art systems will be created using the self supervised approach, especially for low resources Indic languages.
    Metric-Free Individual Fairness in Online Learning. (arXiv:2002.05474v5 [cs.LG] UPDATED)
    (0 min) We study an online learning problem subject to the constraint of individual fairness, which requires that similar individuals are treated similarly. Unlike prior work on individual fairness, we do not assume the similarity measure among individuals is known, nor do we assume that such measure takes a certain parametric form. Instead, we leverage the existence of an auditor who detects fairness violations without enunciating the quantitative measure. In each round, the auditor examines the learner's decisions and attempts to identify a pair of individuals that are treated unfairly by the learner. We provide a general reduction framework that reduces online classification in our model to standard online classification, which allows us to leverage existing online learning algorithms to achieve sub-linear regret and number of fairness violations. Surprisingly, in the stochastic setting where the data are drawn independently from a distribution, we are also able to establish PAC-style fairness and accuracy generalization guarantees (Yona and Rothblum [2018]), despite only having access to a very restricted form of fairness feedback. Our fairness generalization bound qualitatively matches the uniform convergence bound of Yona and Rothblum [2018], while also providing a meaningful accuracy generalization guarantee. Our results resolve an open question by Gillen et al. [2018] by showing that online learning under an unknown individual fairness constraint is possible even without assuming a strong parametric form of the underlying similarity measure.
    Pushing the limits of self-supervised ResNets: Can we outperform supervised learning without labels on ImageNet?. (arXiv:2201.05119v1 [cs.CV])
    (2 min) Despite recent progress made by self-supervised methods in representation learning with residual networks, they still underperform supervised learning on the ImageNet classification benchmark, limiting their applicability in performance-critical settings. Building on prior theoretical insights from Mitrovic et al., 2021, we propose ReLICv2 which combines an explicit invariance loss with a contrastive objective over a varied set of appropriately constructed data views. ReLICv2 achieves 77.1% top-1 classification accuracy on ImageNet using linear evaluation with a ResNet50 architecture and 80.6% with larger ResNet models, outperforming previous state-of-the-art self-supervised approaches by a wide margin. Most notably, ReLICv2 is the first representation learning method to consistently outperform the supervised baseline in a like-for-like comparison using a range of standard ResNet architectures. Finally we show that despite using ResNet encoders, ReLICv2 is comparable to state-of-the-art self-supervised vision transformers.
    Adversarially Robust Classification by Conditional Generative Model Inversion. (arXiv:2201.04733v1 [cs.LG])
    (2 min) Most adversarial attack defense methods rely on obfuscating gradients. These methods are successful in defending against gradient-based attacks; however, they are easily circumvented by attacks which either do not use the gradient or by attacks which approximate and use the corrected gradient. Defenses that do not obfuscate gradients such as adversarial training exist, but these approaches generally make assumptions about the attack such as its magnitude. We propose a classification model that does not obfuscate gradients and is robust by construction without assuming prior knowledge about the attack. Our method casts classification as an optimization problem where we "invert" a conditional generator trained on unperturbed, natural images to find the class that generates the closest sample to the query image. We hypothesize that a potential source of brittleness against adversarial attacks is the high-to-low-dimensional nature of feed-forward classifiers which allows an adversary to find small perturbations in the input space that lead to large changes in the output space. On the other hand, a generative model is typically a low-to-high-dimensional mapping. While the method is related to Defense-GAN, the use of a conditional generative model and inversion in our model instead of the feed-forward classifier is a critical difference. Unlike Defense-GAN, which was shown to generate obfuscated gradients that are easily circumvented, we show that our method does not obfuscate gradients. We demonstrate that our model is extremely robust against black-box attacks and has improved robustness against white-box attacks compared to naturally trained, feed-forward classifiers.
    Multi-Scale Adaptive Graph Neural Network for Multivariate Time Series Forecasting. (arXiv:2201.04828v1 [cs.LG])
    (2 min) Multivariate time series (MTS) forecasting plays an important role in the automation and optimization of intelligent applications. It is a challenging task, as we need to consider both complex intra-variable dependencies and inter-variable dependencies. Existing works only learn temporal patterns with the help of single inter-variable dependencies. However, there are multi-scale temporal patterns in many real-world MTS. Single inter-variable dependencies make the model prefer to learn one type of prominent and shared temporal patterns. In this paper, we propose a multi-scale adaptive graph neural network (MAGNN) to address the above issue. MAGNN exploits a multi-scale pyramid network to preserve the underlying temporal dependencies at different time scales. Since the inter-variable dependencies may be different under distinct time scales, an adaptive graph learning module is designed to infer the scale-specific inter-variable dependencies without pre-defined priors. Given the multi-scale feature representations and scale-specific inter-variable dependencies, a multi-scale temporal graph neural network is introduced to jointly model intra-variable dependencies and inter-variable dependencies. After that, we develop a scale-wise fusion module to effectively promote the collaboration across different time scales, and automatically capture the importance of contributed temporal patterns. Experiments on four real-world datasets demonstrate that MAGNN outperforms the state-of-the-art methods across various settings.
    Toddler-Guidance Learning: Impacts of Critical Period on Multimodal AI Agents. (arXiv:2201.04990v1 [cs.LG])
    (2 min) Critical periods are phases during which a toddler's brain develops in spurts. To promote children's cognitive development, proper guidance is critical in this stage. However, it is not clear whether such a critical period also exists for the training of AI agents. Similar to human toddlers, well-timed guidance and multimodal interactions might significantly enhance the training efficiency of AI agents as well. To validate this hypothesis, we adapt this notion of critical periods to learning in AI agents and investigate the critical period in the virtual environment for AI agents. We formalize the critical period and Toddler-guidance learning in the reinforcement learning (RL) framework. Then, we built up a toddler-like environment with VECA toolkit to mimic human toddlers' learning characteristics. We study three discrete levels of mutual interaction: weak-mentor guidance (sparse reward), moderate mentor guidance (helper-reward), and mentor demonstration (behavioral cloning). We also introduce the EAVE dataset consisting of 30,000 real-world images to fully reflect the toddler's viewpoint. We evaluate the impact of critical periods on AI agents from two perspectives: how and when they are guided best in both uni- and multimodal learning. Our experimental results show that both uni- and multimodal agents with moderate mentor guidance and critical period on 1 million and 2 million training steps show a noticeable improvement. We validate these results with transfer learning on the EAVE dataset and find the performance advancement on the same critical period and the guidance.
    Solving Dynamic Graph Problems with Multi-Attention Deep Reinforcement Learning. (arXiv:2201.04895v1 [cs.LG])
    (2 min) Graph problems such as traveling salesman problem, or finding minimal Steiner trees are widely studied and used in data engineering and computer science. Typically, in real-world applications, the features of the graph tend to change over time, thus, finding a solution to the problem becomes challenging. The dynamic version of many graph problems are the key for a plethora of real-world problems in transportation, telecommunication, and social networks. In recent years, using deep learning techniques to find heuristic solutions for NP-hard graph combinatorial problems has gained much interest as these learned heuristics can find near-optimal solutions efficiently. However, most of the existing methods for learning heuristics focus on static graph problems. The dynamic nature makes NP-hard graph problems much more challenging to learn, and the existing methods fail to find reasonable solutions. In this paper, we propose a novel architecture named Graph Temporal Attention with Reinforcement Learning (GTA-RL) to learn heuristic solutions for graph-based dynamic combinatorial optimization problems. The GTA-RL architecture consists of an encoder capable of embedding temporal features of a combinatorial problem instance and a decoder capable of dynamically focusing on the embedded features to find a solution to a given combinatorial problem instance. We then extend our architecture to learn heuristics for the real-time version of combinatorial optimization problems where all input features of a problem are not known a prior, but rather learned in real-time. Our experimental results against several state-of-the-art learning-based algorithms and optimal solvers demonstrate that our approach outperforms the state-of-the-art learning-based approaches in terms of effectiveness and optimal solvers in terms of efficiency on dynamic and real-time graph combinatorial optimization.
    Automated Reinforcement Learning: An Overview. (arXiv:2201.05000v1 [cs.LG])
    (2 min) Reinforcement Learning and recently Deep Reinforcement Learning are popular methods for solving sequential decision making problems modeled as Markov Decision Processes. RL modeling of a problem and selecting algorithms and hyper-parameters require careful considerations as different configurations may entail completely different performances. These considerations are mainly the task of RL experts; however, RL is progressively becoming popular in other fields where the researchers and system designers are not RL experts. Besides, many modeling decisions, such as defining state and action space, size of batches and frequency of batch updating, and number of timesteps are typically made manually. For these reasons, automating different components of RL framework is of great importance and it has attracted much attention in recent years. Automated RL provides a framework in which different components of RL including MDP modeling, algorithm selection and hyper-parameter optimization are modeled and defined automatically. In this article, we explore the literature and present recent work that can be used in automated RL. Moreover, we discuss the challenges, open questions and research directions in AutoRL.
    Automatic Sparse Connectivity Learning for Neural Networks. (arXiv:2201.05020v1 [cs.CV])
    (2 min) Since sparse neural networks usually contain many zero weights, these unnecessary network connections can potentially be eliminated without degrading network performance. Therefore, well-designed sparse neural networks have the potential to significantly reduce FLOPs and computational resources. In this work, we propose a new automatic pruning method - Sparse Connectivity Learning (SCL). Specifically, a weight is re-parameterized as an element-wise multiplication of a trainable weight variable and a binary mask. Thus, network connectivity is fully described by the binary mask, which is modulated by a unit step function. We theoretically prove the fundamental principle of using a straight-through estimator (STE) for network pruning. This principle is that the proxy gradients of STE should be positive, ensuring that mask variables converge at their minima. After finding Leaky ReLU, Softplus, and Identity STEs can satisfy this principle, we propose to adopt Identity STE in SCL for discrete mask relaxation. We find that mask gradients of different features are very unbalanced, hence, we propose to normalize mask gradients of each feature to optimize mask variable training. In order to automatically train sparse masks, we include the total number of network connections as a regularization term in our objective function. As SCL does not require pruning criteria or hyper-parameters defined by designers for network layers, the network is explored in a larger hypothesis space to achieve optimized sparse connectivity for the best performance. SCL overcomes the limitations of existing automatic pruning methods. Experimental results demonstrate that SCL can automatically learn and select important network connections for various baseline network structures. Deep learning models trained by SCL outperform the SOTA human-designed and automatic pruning methods in sparsity, accuracy, and FLOPs reduction.
    Multi-Stage Hybrid Federated Learning over Large-Scale D2D-Enabled Fog Networks. (arXiv:2007.09511v5 [cs.NI] UPDATED)
    (3 min) Federated learning has generated significant interest, with nearly all works focused on a "star" topology where nodes/devices are each connected to a central server. We migrate away from this architecture and extend it through the network dimension to the case where there are multiple layers of nodes between the end devices and the server. Specifically, we develop multi-stage hybrid federated learning (MH-FL), a hybrid of intra- and inter-layer model learning that considers the network as a multi-layer cluster-based structure. MH-FL considers the topology structures among the nodes in the clusters, including local networks formed via device-to-device (D2D) communications, and presumes a semi-decentralized architecture for federated learning. It orchestrates the devices at different network layers in a collaborative/cooperative manner (i.e., using D2D interactions) to form local consensus on the model parameters and combines it with multi-stage parameter relaying between layers of the tree-shaped hierarchy. We derive the upper bound of convergence for MH-FL with respect to parameters of the network topology (e.g., the spectral radius) and the learning algorithm (e.g., the number of D2D rounds in different clusters). We obtain a set of policies for the D2D rounds at different clusters to guarantee either a finite optimality gap or convergence to the global optimum. We then develop a distributed control algorithm for MH-FL to tune the D2D rounds in each cluster over time to meet specific convergence criteria. Our experiments on real-world datasets verify our analytical results and demonstrate the advantages of MH-FL in terms of resource utilization metrics.
    Spatiotemporal Clustering with Neyman-Scott Processes via Connections to Bayesian Nonparametric Mixture Models. (arXiv:2201.05044v1 [stat.ML])
    (2 min) Neyman-Scott process (NSP) are point process models that generate clusters of points in time or space. They are natural models for a wide range of phenomena, ranging from neural spike trains to document streams. The clustering property is achieved via a doubly stochastic formulation: first, a set of latent events is drawn from a Poisson process; then, each latent event generates a set of observed data points according to another Poisson process. This construction is similar to Bayesian nonparametric mixture models like the Dirichlet process mixture model (DPMM) in that the number of latent events (i.e. clusters) is a random variable, but the point process formulation makes the NSP especially well suited to modeling spatiotemporal data. While many specialized algorithms have been developed for DPMMs, comparatively fewer works have focused on inference in NSPs. Here, we present novel connections between NSPs and DPMMs, with the key link being a third class of Bayesian mixture models called mixture of finite mixture models (MFMMs). Leveraging this connection, we adapt the standard collapsed Gibbs sampling algorithm for DPMMs to enable scalable Bayesian inference on NSP models. We demonstrate the potential of Neyman-Scott processes on a variety of applications including sequence detection in neural spike trains and event detection in document streams.
    SeamlessGAN: Self-Supervised Synthesis of Tileable Texture Maps. (arXiv:2201.05120v1 [cs.CV])
    (2 min) We present SeamlessGAN, a method capable of automatically generating tileable texture maps from a single input exemplar. In contrast to most existing methods, focused solely on solving the synthesis problem, our work tackles both problems, synthesis and tileability, simultaneously. Our key idea is to realize that tiling a latent space within a generative network trained using adversarial expansion techniques produces outputs with continuity at the seam intersection that can be then be turned into tileable images by cropping the central area. Since not every value of the latent space is valid to produce high-quality outputs, we leverage the discriminator as a perceptual error metric capable of identifying artifact-free textures during a sampling process. Further, in contrast to previous work on deep texture synthesis, our model is designed and optimized to work with multi-layered texture representations, enabling textures composed of multiple maps such as albedo, normals, etc. We extensively test our design choices for the network architecture, loss function and sampling parameters. We show qualitatively and quantitatively that our approach outperforms previous methods and works for textures of different types.
    Evaluation of Four Black-box Adversarial Attacks and Some Query-efficient Improvement Analysis. (arXiv:2201.05001v1 [cs.CR])
    (2 min) With the fast development of machine learning technologies, deep learning models have been deployed in almost every aspect of everyday life. However, the privacy and security of these models are threatened by adversarial attacks. Among which black-box attack is closer to reality, where limited knowledge can be acquired from the model. In this paper, we provided basic background knowledge about adversarial attack and analyzed four black-box attack algorithms: Bandits, NES, Square Attack and ZOsignSGD comprehensively. We also explored the newly proposed Square Attack method with respect to square size, hoping to improve its query efficiency.
    Multi-View Non-negative Matrix Factorization Discriminant Learning via Cross Entropy Loss. (arXiv:2201.04726v1 [cs.LG])
    (2 min) Multi-view learning accomplishes the task objectives of classification by leverag-ing the relationships between different views of the same object. Most existing methods usually focus on consistency and complementarity between multiple views. But not all of this information is useful for classification tasks. Instead, it is the specific discriminating information that plays an important role. Zhong Zhang et al. explore the discriminative and non-discriminative information exist-ing in common and view-specific parts among different views via joint non-negative matrix factorization. In this paper, we improve this algorithm on this ba-sis by using the cross entropy loss function to constrain the objective function better. At last, we implement better classification effect than original on the same data sets and show its superiority over many state-of-the-art algorithms.
    Stock Movement Prediction Based on Bi-typed and Hybrid-relational Market Knowledge Graph via Dual Attention Networks. (arXiv:2201.04965v1 [q-fin.ST])
    (2 min) Stock Movement Prediction (SMP) aims at predicting listed companies' stock future price trend, which is a challenging task due to the volatile nature of financial markets. Recent financial studies show that the momentum spillover effect plays a significant role in stock fluctuation. However, previous studies typically only learn the simple connection information among related companies, which inevitably fail to model complex relations of listed companies in the real financial market. To address this issue, we first construct a more comprehensive Market Knowledge Graph (MKG) which contains bi-typed entities including listed companies and their associated executives, and hybrid-relations including the explicit relations and implicit relations. Afterward, we propose DanSmp, a novel Dual Attention Networks to learn the momentum spillover signals based upon the constructed MKG for stock prediction. The empirical experiments on our constructed datasets against nine SOTA baselines demonstrate that the proposed DanSmp is capable of improving stock prediction with the constructed MKG.
    Non-Stationary Representation Learning in Sequential Linear Bandits. (arXiv:2201.04805v1 [cs.LG])
    (2 min) In this paper, we study representation learning for multi-task decision-making in non-stationary environments. We consider the framework of sequential linear bandits, where the agent performs a series of tasks drawn from distinct sets associated with different environments. The embeddings of tasks in each set share a low-dimensional feature extractor called representation, and representations are different across sets. We propose an online algorithm that facilitates efficient decision-making by learning and transferring non-stationary representations in an adaptive fashion. We prove that our algorithm significantly outperforms the existing ones that treat tasks independently. We also conduct experiments using both synthetic and real data to validate our theoretical insights and demonstrate the efficacy of our algorithm.
    How Can Graph Neural Networks Help Document Retrieval: A Case Study on CORD19 with Concept Map Generation. (arXiv:2201.04672v1 [cs.IR])
    (2 min) Graph neural networks (GNNs), as a group of powerful tools for representation learning on irregular data, have manifested superiority in various downstream tasks. With unstructured texts represented as concept maps, GNNs can be exploited for tasks like document retrieval. Intrigued by how can GNNs help document retrieval, we conduct an empirical study on a large-scale multi-discipline dataset CORD-19. Results show that instead of the complex structure-oriented GNNs such as GINs and GATs, our proposed semantics-oriented graph functions achieve better and more stable performance based on the BM25 retrieved candidates. Our insights in this case study can serve as a guideline for future work to develop effective GNNs with appropriate semantics-oriented inductive biases for textual reasoning tasks like document retrieval and classification. All code for this case study is available at https://github.com/HennyJie/GNN-DocRetrieval.
    When Machine Learning Meets Spectrum Sharing Security: Methodologies and Challenges. (arXiv:2201.04677v1 [cs.CR])
    (2 min) The exponential growth of internet connected systems has generated numerous challenges, such as spectrum shortage issues, which require efficient spectrum sharing (SS) solutions. Complicated and dynamic SS systems can be exposed to different potential security and privacy issues, requiring protection mechanisms to be adaptive, reliable, and scalable. Machine learning (ML) based methods have frequently been proposed to address those issues. In this article, we provide a comprehensive survey of the recent development of ML based SS methods, the most critical security issues, and corresponding defense mechanisms. In particular, we elaborate the state-of-the-art methodologies for improving the performance of SS communication systems for various vital aspects, including ML based cognitive radio networks (CRNs), ML based database assisted SS networks, ML based LTE-U networks, ML based ambient backscatter networks, and other ML based SS solutions. We also present security issues from the physical layer and corresponding defending strategies based on ML algorithms, including Primary User Emulation (PUE) attacks, Spectrum Sensing Data Falsification (SSDF) attacks, jamming attacks, eavesdropping attacks, and privacy issues. Finally, extensive discussions on open challenges for ML based SS are also given. This comprehensive review is intended to provide the foundation for and facilitate future studies on exploring the potential of emerging ML for coping with increasingly complex SS and their security problems.
    Learning Without a Global Clock: Asynchronous Learning in a Physics-Driven Learning Network. (arXiv:2201.04626v1 [cond-mat.soft])
    (2 min) In a neuron network, synapses update individually using local information, allowing for entirely decentralized learning. In contrast, elements in an artificial neural network (ANN) are typically updated simultaneously using a central processor. Here we investigate the feasibility and effect of asynchronous learning in a recently introduced decentralized, physics-driven learning network. We show that desynchronizing the learning process does not degrade performance for a variety of tasks in an idealized simulation. In experiment, desynchronization actually improves performance by allowing the system to better explore the discretized state space of solutions. We draw an analogy between asynchronicity and mini-batching in stochastic gradient descent, and show that they have similar effects on the learning process. Desynchronizing the learning process establishes physics-driven learning networks as truly fully distributed learning machines, promoting better performance and scalability in deployment.
    On Sampling Collaborative Filtering Datasets. (arXiv:2201.04768v1 [cs.LG])
    (2 min) We study the practical consequences of dataset sampling strategies on the ranking performance of recommendation algorithms. Recommender systems are generally trained and evaluated on samples of larger datasets. Samples are often taken in a naive or ad-hoc fashion: e.g. by sampling a dataset randomly or by selecting users or items with many interactions. As we demonstrate, commonly-used data sampling schemes can have significant consequences on algorithm performance. Following this observation, this paper makes three main contributions: (1) characterizing the effect of sampling on algorithm performance, in terms of algorithm and dataset characteristics (e.g. sparsity characteristics, sequential dynamics, etc.); (2) designing SVP-CF, which is a data-specific sampling strategy, that aims to preserve the relative performance of models after sampling, and is especially suited to long-tailed interaction data; and (3) developing an oracle, Data-Genie, which can suggest the sampling scheme that is most likely to preserve model performance for a given dataset. The main benefit of Data-Genie is that it will allow recommender system practitioners to quickly prototype and compare various approaches, while remaining confident that algorithm performance will be preserved, once the algorithm is retrained and deployed on the complete data. Detailed experiments show that using Data-Genie, we can discard upto 5x more data than any sampling strategy with the same level of performance.
    Sparsely Changing Latent States for Prediction and Planning in Partially Observable Domains. (arXiv:2110.15949v2 [cs.LG] UPDATED)
    (2 min) A common approach to prediction and planning in partially observable domains is to use recurrent neural networks (RNNs), which ideally develop and maintain a latent memory about hidden, task-relevant factors. We hypothesize that many of these hidden factors in the physical world are constant over time, changing only sparsely. To study this hypothesis, we propose Gated $L_0$ Regularized Dynamics (GateL0RD), a novel recurrent architecture that incorporates the inductive bias to maintain stable, sparsely changing latent states. The bias is implemented by means of a novel internal gating function and a penalty on the $L_0$ norm of latent state changes. We demonstrate that GateL0RD can compete with or outperform state-of-the-art RNNs in a variety of partially observable prediction and control tasks. GateL0RD tends to encode the underlying generative factors of the environment, ignores spurious temporal dependencies, and generalizes better, improving sampling efficiency and overall performance in model-based planning and reinforcement learning tasks. Moreover, we show that the developing latent states can be easily interpreted, which is a step towards better explainability in RNNs.
    Black-box Safety Analysis and Retraining of DNNs based on Feature Extraction and Clustering. (arXiv:2201.05077v1 [cs.SE])
    (2 min) Deep neural networks (DNNs) have demonstrated superior performance over classical machine learning to support many features in safety-critical systems. Although DNNs are now widely used in such systems (e.g., self driving cars), there is limited progress regarding automated support for functional safety analysis in DNN-based systems. For example, the identification of root causes of errors, to enable both risk analysis and DNN retraining, remains an open problem. In this paper, we propose SAFE, a black-box approach to automatically characterize the root causes of DNN errors. SAFE relies on a transfer learning model pre-trained on ImageNet to extract the features from error-inducing images. It then applies a density-based clustering algorithm to detect arbitrary shaped clusters of images modeling plausible causes of error. Last, clusters are used to effectively retrain and improve the DNN. The black-box nature of SAFE is motivated by our objective not to require changes or even access to the DNN internals to facilitate adoption. Experimental results show the superior ability of SAFE in identifying different root causes of DNN errors based on case studies in the automotive domain. It also yields significant improvements in DNN accuracy after retraining, while saving significant execution time and memory when compared to alternatives.
    Recursive Least Squares Policy Control with Echo State Network. (arXiv:2201.04781v1 [cs.LG])
    (2 min) The echo state network (ESN) is a special type of recurrent neural networks for processing the time-series dataset. However, limited by the strong correlation among sequential samples of the agent, ESN-based policy control algorithms are difficult to use the recursive least squares (RLS) algorithm to update the ESN's parameters. To solve this problem, we propose two novel policy control algorithms, ESNRLS-Q and ESNRLS-Sarsa. Firstly, to reduce the correlation of training samples, we use the leaky integrator ESN and the mini-batch learning mode. Secondly, to make RLS suitable for training ESN in mini-batch mode, we present a new mean-approximation method for updating the RLS correlation matrix. Thirdly, to prevent ESN from over-fitting, we use the L1 regularization technique. Lastly, to prevent the target state-action value from overestimation, we employ the Mellowmax method. Simulation results show that our algorithms have good convergence performance.
    Multi-echelon Supply Chains with Uncertain Seasonal Demands and Lead Times Using Deep Reinforcement Learning. (arXiv:2201.04651v1 [cs.LG])
    (2 min) We address the problem of production planning and distribution in multi-echelon supply chains. We consider uncertain demands and lead times which makes the problem stochastic and non-linear. A Markov Decision Process formulation and a Non-linear Programming model are presented. As a sequential decision-making problem, Deep Reinforcement Learning (RL) is a possible solution approach. This type of technique has gained a lot of attention from Artificial Intelligence and Optimization communities in recent years. Considering the good results obtained with Deep RL approaches in different areas there is a growing interest in applying them in problems from the Operations Research field. We have used a Deep RL technique, namely Proximal Policy Optimization (PPO2), to solve the problem considering uncertain, regular and seasonal demands and constant or stochastic lead times. Experiments are carried out in different scenarios to better assess the suitability of the algorithm. An agent based on a linearized model is used as a baseline. Experimental results indicate that PPO2 is a competitive and adequate tool for this type of problem. PPO2 agent is better than baseline in all scenarios with stochastic lead times (7.3-11.2%), regardless of whether demands are seasonal or not. In scenarios with constant lead times, the PPO2 agent is better when uncertain demands are non-seasonal (2.2-4.7%). The results show that the greater the uncertainty of the scenario, the greater the viability of this type of approach.
    Certifiable Robustness for Nearest Neighbor Classifiers. (arXiv:2201.04770v1 [cs.LG])
    (2 min) ML models are typically trained using large datasets of high quality. However, training datasets often contain inconsistent or incomplete data. To tackle this issue, one solution is to develop algorithms that can check whether a prediction of a model is certifiably robust. Given a learning algorithm that produces a classifier and given an example at test time, a classification outcome is certifiably robust if it is predicted by every model trained across all possible worlds (repairs) of the uncertain (inconsistent) dataset. This notion of robustness falls naturally under the framework of certain answers. In this paper, we study the complexity of certifying robustness for a simple but widely deployed classification algorithm, $k$-Nearest Neighbors ($k$-NN). Our main focus is on inconsistent datasets when the integrity constraints are functional dependencies (FDs). For this setting, we establish a dichotomy in the complexity of certifying robustness w.r.t. the set of FDs: the problem either admits a polynomial time algorithm, or it is coNP-hard. Additionally, we exhibit a similar dichotomy for the counting version of the problem, where the goal is to count the number of possible worlds that predict a certain label. As a byproduct of our study, we also establish the complexity of a problem related to finding an optimal subset repair that may be of independent interest.
    Direct Mutation and Crossover in Genetic Algorithms Applied to Reinforcement Learning Tasks. (arXiv:2201.04815v1 [cs.NE])
    (2 min) Neuroevolution has recently been shown to be quite competitive in reinforcement learning (RL) settings, and is able to alleviate some of the drawbacks of gradient-based approaches. This paper will focus on applying neuroevolution using a simple genetic algorithm (GA) to find the weights of a neural network that produce optimally behaving agents. In addition, we present two novel modifications that improve the data efficiency and speed of convergence when compared to the initial implementation. The modifications are evaluated on the FrozenLake environment provided by OpenAI gym and prove to be significantly better than the baseline approach.
    On the Design of Graph Embeddings for the Sensorless Estimation of Road Traffic Profiles. (arXiv:2201.04968v1 [cs.LG])
    (0 min) Traffic forecasting models rely on data that needs to be sensed, processed, and stored. This requires the deployment and maintenance of traffic sensing infrastructure, often leading to unaffordable monetary costs. The lack of sensed locations can be complemented with synthetic data simulations that further lower the economical investment needed for traffic monitoring. One of the most common data generative approaches consists of producing real-like traffic patterns, according to data distributions from analogous roads. The process of detecting roads with similar traffic is the key point of these systems. However, without collecting data at the target location no flow metrics can be employed for this similarity-based search. We present a method to discover locations among those with available traffic data by inspecting topological features of road segments. Relevant topological features are extracted as numerical representations (embeddings) to compare different locations and eventually find the most similar roads based on the similarity between their embeddings. The performance of this novel selection system is examined and compared to simpler traffic estimation approaches. After finding a similar source of data, a generative method is used to synthesize traffic profiles. Depending on the resemblance of the traffic behavior at the sensed road, the generation method can be fed with data from one road only. Several generation approaches are analyzed in terms of the precision of the synthesized samples. Above all, this work intends to stimulate further research efforts towards enhancing the quality of synthetic traffic samples and thereby, reducing the need for sensing infrastructure.
    Criticality-Based Varying Step-Number Algorithm for Reinforcement Learning. (arXiv:2201.05034v1 [cs.LG])
    (0 min) In the context of reinforcement learning we introduce the concept of criticality of a state, which indicates the extent to which the choice of action in that particular state influences the expected return. That is, a state in which the choice of action is more likely to influence the final outcome is considered as more critical than a state in which it is less likely to influence the final outcome. We formulate a criticality-based varying step number algorithm (CVS) - a flexible step number algorithm that utilizes the criticality function provided by a human, or learned directly from the environment. We test it in three different domains including the Atari Pong environment, Road-Tree environment, and Shooter environment. We demonstrate that CVS is able to outperform popular learning algorithms such as Deep Q-Learning and Monte Carlo.
    Multi-task longitudinal forecasting with missing values on Alzheimer's Disease. (arXiv:2201.05040v1 [stat.ML])
    (0 min) Machine learning techniques typically applied to dementia forecasting lack in their capabilities to jointly learn several tasks, handle time dependent heterogeneous data and missing values. In this paper, we propose a framework using the recently presented SSHIBA model for jointly learning different tasks on longitudinal data with missing values. The method uses Bayesian variational inference to impute missing values and combine information of several views. This way, we can combine different data-views from different time-points in a common latent space and learn the relations between each time-point while simultaneously modelling and predicting several output variables. We apply this model to predict together diagnosis, ventricle volume, and clinical scores in dementia. The results demonstrate that SSHIBA is capable of learning a good imputation of the missing values and outperforming the baselines while simultaneously predicting three different tasks.
    VELVET: a noVel Ensemble Learning approach to automatically locate VulnErable sTatements. (arXiv:2112.10893v2 [cs.SE] UPDATED)
    (0 min) Automatically locating vulnerable statements in source code is crucial to assure software security and alleviate developers' debugging efforts. This becomes even more important in today's software ecosystem, where vulnerable code can flow easily and unwittingly within and across software repositories like GitHub. Across such millions of lines of code, traditional static and dynamic approaches struggle to scale. Although existing machine-learning-based approaches look promising in such a setting, most work detects vulnerable code at a higher granularity -- at the method or file level. Thus, developers still need to inspect a significant amount of code to locate the vulnerable statement(s) that need to be fixed. This paper presents VELVET, a novel ensemble learning approach to locate vulnerable statements. Our model combines graph-based and sequence-based neural networks to successfully capture the local and global context of a program graph and effectively understand code semantics and vulnerable patterns. To study VELVET's effectiveness, we use an off-the-shelf synthetic dataset and a recently published real-world dataset. In the static analysis setting, where vulnerable functions are not detected in advance, VELVET achieves 4.5x better performance than the baseline static analyzers on the real-world data. For the isolated vulnerability localization task, where we assume the vulnerability of a function is known while the specific vulnerable statement is unknown, we compare VELVET with several neural networks that also attend to local and global context of code. VELVET achieves 99.6% and 43.6% top-1 accuracy over synthetic data and real-world data, respectively, outperforming the baseline deep-learning models by 5.3-29.0%.
    Combining Interventional and Observational Data Using Causal Reductions. (arXiv:2103.04786v2 [stat.ML] UPDATED)
    (2 min) Unobserved confounding is one of the main challenges when estimating causal effects. We propose a causal reduction method that, given a causal model, replaces an arbitrary number of possibly high-dimensional latent confounders with a single latent confounder that takes values in the same space as the treatment variable, without changing the observational and interventional distributions the causal model entails. This allows us to estimate the causal effect in a principled way from combined data without relying on the common but often unrealistic assumption that all confounders have been observed. We apply our causal reduction in three different settings. In the first setting, we assume the treatment and outcome to be discrete. The causal reduction then implies bounds between the observational and interventional distributions that can be exploited for estimation purposes. In certain cases with highly unbalanced observational samples, the accuracy of the causal effect estimate can be improved by incorporating observational data. Second, for continuous variables and assuming a linear-Gaussian model, we derive equality constraints for the parameters of the observational and interventional distributions. Third, for the general continuous setting (possibly nonlinear or non-Gaussian), we parameterize the reduced causal model using normalizing flows, a flexible class of easily invertible nonlinear transformations. We perform a series of experiments on synthetic data and find that in several cases the number of interventional samples can be reduced when adding observational training samples without sacrificing accuracy.
    Evaluation of Neural Networks Defenses and Attacks using NDCG and Reciprocal Rank Metrics. (arXiv:2201.05071v1 [cs.CR])
    (2 min) The problem of attacks on neural networks through input modification (i.e., adversarial examples) has attracted much attention recently. Being relatively easy to generate and hard to detect, these attacks pose a security breach that many suggested defenses try to mitigate. However, the evaluation of the effect of attacks and defenses commonly relies on traditional classification metrics, without adequate adaptation to adversarial scenarios. Most of these metrics are accuracy-based, and therefore may have a limited scope and low distinctive power. Other metrics do not consider the unique characteristics of neural networks functionality, or measure the effect of the attacks indirectly (e.g., through the complexity of their generation). In this paper, we present two metrics which are specifically designed to measure the effect of attacks, or the recovery effect of defenses, on the output of neural networks in multiclass classification tasks. Inspired by the normalized discounted cumulative gain and the reciprocal rank metrics used in information retrieval literature, we treat the neural network predictions as ranked lists of results. Using additional information about the probability of the rank enabled us to define novel metrics that are suited to the task at hand. We evaluate our metrics using various attacks and defenses on a pretrained VGG19 model and the ImageNet dataset. Compared to the common classification metrics, our proposed metrics demonstrate superior informativeness and distinctiveness.
    Pointwise Binary Classification with Pairwise Confidence Comparisons. (arXiv:2010.01875v4 [cs.LG] UPDATED)
    (2 min) To alleviate the data requirement for training effective binary classifiers in binary classification, many weakly supervised learning settings have been proposed. Among them, some consider using pairwise but not pointwise labels, when pointwise labels are not accessible due to privacy, confidentiality, or security reasons. However, as a pairwise label denotes whether or not two data points share a pointwise label, it cannot be easily collected if either point is equally likely to be positive or negative. Thus, in this paper, we propose a novel setting called pairwise comparison (Pcomp) classification, where we have only pairs of unlabeled data that we know one is more likely to be positive than the other. Firstly, we give a Pcomp data generation process, derive an unbiased risk estimator (URE) with theoretical guarantee, and further improve URE using correction functions. Secondly, we link Pcomp classification to noisy-label learning to develop a progressive URE and improve it by imposing consistency regularization. Finally, we demonstrate by experiments the effectiveness of our methods, which suggests Pcomp is a valuable and practically useful type of pairwise supervision besides the pairwise label.
    An improved LogNNet classifier for IoT application. (arXiv:2105.14412v2 [cs.LG] UPDATED)
    (2 min) In the age of neural networks and Internet of Things (IoT), the search for new neural network architectures capable of operating on devices with limited computing power and small memory size is becoming an urgent agenda. Designing suitable algorithms for IoT applications is an important task. The paper proposes a feed forward LogNNet neural network, which uses a semi-linear Henon type discrete chaotic map to classify MNIST-10 dataset. The model is composed of reservoir part and trainable classifier. The aim of the reservoir part is transforming the inputs to maximize the classification accuracy using a special matrix filing method and a time series generated by the chaotic map. The parameters of the chaotic map are optimized using particle swarm optimization with random immigrants. As a result, the proposed LogNNet/Henon classifier has higher accuracy and the same RAM usage, compared to the original version of LogNNet, and offers promising opportunities for implementation in IoT devices. In addition, a direct relation between the value of entropy and accuracy of the classification is demonstrated.
    Global Optimality Beyond Two Layers: Training Deep ReLU Networks via Convex Programs. (arXiv:2110.05518v2 [cs.LG] UPDATED)
    (2 min) Understanding the fundamental mechanism behind the success of deep neural networks is one of the key challenges in the modern machine learning literature. Despite numerous attempts, a solid theoretical analysis is yet to be developed. In this paper, we develop a novel unified framework to reveal a hidden regularization mechanism through the lens of convex optimization. We first show that the training of multiple three-layer ReLU sub-networks with weight decay regularization can be equivalently cast as a convex optimization problem in a higher dimensional space, where sparsity is enforced via a group $\ell_1$-norm regularization. Consequently, ReLU networks can be interpreted as high dimensional feature selection methods. More importantly, we then prove that the equivalent convex problem can be globally optimized by a standard convex optimization solver with a polynomial-time complexity with respect to the number of samples and data dimension when the width of the network is fixed. Finally, we numerically validate our theoretical results via experiments involving both synthetic and real datasets.
    Unlabeled Data Improves Adversarial Robustness. (arXiv:1905.13736v4 [stat.ML] UPDATED)
    (2 min) We demonstrate, theoretically and empirically, that adversarial robustness can significantly benefit from semisupervised learning. Theoretically, we revisit the simple Gaussian model of Schmidt et al. that shows a sample complexity gap between standard and robust classification. We prove that unlabeled data bridges this gap: a simple semisupervised learning procedure (self-training) achieves high robust accuracy using the same number of labels required for achieving high standard accuracy. Empirically, we augment CIFAR-10 with 500K unlabeled images sourced from 80 Million Tiny Images and use robust self-training to outperform state-of-the-art robust accuracies by over 5 points in (i) $\ell_\infty$ robustness against several strong attacks via adversarial training and (ii) certified $\ell_2$ and $\ell_\infty$ robustness via randomized smoothing. On SVHN, adding the dataset's own extra training set with the labels removed provides gains of 4 to 10 points, within 1 point of the gain from using the extra labels.
    Ensemble Augmentation for Deep Neural Networks Using 1-D Time Series Vibration Data. (arXiv:2108.03288v2 [cs.LG] UPDATED)
    (2 min) Time-series data are one of the fundamental types of raw data representation used in data-driven techniques. In machine condition monitoring, time-series vibration data are overly used in data mining for deep neural networks. Typically, vibration data is converted into images for classification using Deep Neural Networks (DNNs), and scalograms are the most effective form of image representation. However, the DNN classifiers require huge labeled training samples to reach their optimum performance. So, many forms of data augmentation techniques are applied to the classifiers to compensate for the lack of training samples. However, the scalograms are graphical representations where the existing augmentation techniques suffer because they either change the graphical meaning or have too much noise in the samples that change the physical meaning. In this study, a data augmentation technique named ensemble augmentation is proposed to overcome this limitation. This augmentation method uses the power of white noise added in ensembles to the original samples to generate real-like samples. After averaging the signal with ensembles, a new signal is obtained that contains the characteristics of the original signal. The parameters for the ensemble augmentation are validated using a simulated signal. The proposed method is evaluated using 10 class bearing vibration data using three state-of-the-art Transfer Learning (TL) models, namely, Inception-V3, MobileNet-V2, and ResNet50. Augmented samples are generated in two increments: the first increment generates the same number of fake samples as the training samples, and in the second increment, the number of samples is increased gradually. The outputs from the proposed method are compared with no augmentation, augmentations using deep convolution generative adversarial network (DCGAN), and several geometric transformation-based augmentations...
    Largest Eigenvalues of the Conjugate Kernel of Single-Layered Neural Networks. (arXiv:2201.04753v1 [math.PR])
    (2 min) This paper is concerned with the asymptotic distribution of the largest eigenvalues for some nonlinear random matrix ensemble stemming from the study of neural networks. More precisely we consider $M= \frac{1}{m} YY^\top$ with $Y=f(WX)$ where $W$ and $X$ are random rectangular matrices with i.i.d. centered entries. This models the data covariance matrix or the Conjugate Kernel of a single layered random Feed-Forward Neural Network. The function $f$ is applied entrywise and can be seen as the activation function of the neural network. We show that the largest eigenvalue has the same limit (in probability) as that of some well-known linear random matrix ensembles. In particular, we relate the asymptotic limit of the largest eigenvalue for the nonlinear model to that of an information-plus-noise random matrix, establishing a possible phase transition depending on the function $f$ and the distribution of $W$ and $X$. This may be of interest for applications to machine learning.
    Privacy-Utility Trades in Crowdsourced Signal Map Obfuscation. (arXiv:2201.04782v1 [cs.CR])
    (2 min) Cellular providers and data aggregating companies crowdsource celluar signal strength measurements from user devices to generate signal maps, which can be used to improve network performance. Recognizing that this data collection may be at odds with growing awareness of privacy concerns, we consider obfuscating such data before the data leaves the mobile device. The goal is to increase privacy such that it is difficult to recover sensitive features from the obfuscated data (e.g. user ids and user whereabouts), while still allowing network providers to use the data for improving network services (i.e. create accurate signal maps). To examine this privacy-utility tradeoff, we identify privacy and utility metrics and threat models suited to signal strength measurements. We then obfuscate the measurements using several preeminent techniques, spanning differential privacy, generative adversarial privacy, and information-theoretic privacy techniques, in order to benchmark a variety of promising obfuscation approaches and provide guidance to real-world engineers who are tasked to build signal maps that protect privacy without hurting utility. Our evaluation results, based on multiple, diverse, real-world signal map datasets, demonstrate the feasibility of concurrently achieving adequate privacy and utility, with obfuscation strategies which use the structure and intended use of datasets in their design, and target average-case, rather than worst-case, guarantees.
    1-Dimensional polynomial neural networks for audio signal related problems. (arXiv:2009.04077v2 [eess.AS] UPDATED)
    (2 min) In addition to being extremely non-linear, modern problems require millions if not billions of parameters to solve or at least to get a good approximation of the solution, and neural networks are known to assimilate that complexity by deepening and widening their topology in order to increase the level of non-linearity needed for a better approximation. However, compact topologies are always preferred to deeper ones as they offer the advantage of using less computational units and less parameters. This compacity comes at the price of reduced non-linearity and thus, of limited solution search space. We propose the 1-Dimensional Polynomial Neural Network (1DPNN) model that uses automatic polynomial kernel estimation for 1-Dimensional Convolutional Neural Networks (1DCNNs) and that introduces a high degree of non-linearity from the first layer which can compensate the need for deep and/or wide topologies. We show that this non-linearity enables the model to yield better results with less computational and spatial complexity than a regular 1DCNN on various classification and regression problems related to audio signals, even though it introduces more computational and spatial complexity on a neuronal level. The experiments were conducted on three publicly available datasets and demonstrate that, on the problems that were tackled, the proposed model can extract more relevant information from the data than a 1DCNN in less time and with less memory.
    Interpretable Automated Diagnosis of Retinal Disease Using Deep OCT Analysis. (arXiv:2109.02436v2 [eess.IV] UPDATED)
    (2 min) 30 million Optical Coherence Tomography (OCT) imaging tests are issued every year to diagnose various retinal diseases, but accurate diagnosis of OCT scans requires trained ophthalmologists who are still prone to making errors. With better systems for diagnosis, many cases of vision loss caused by retinal disease could be entirely avoided. In this work, we develop a novel deep learning architecture for explainable, accurate classification of retinal disease which achieves state-of-the-art accuracy. Furthermore, we place an emphasis on producing both qualitative and quantitative explanations of the model's decisions. Our algorithm produces heatmaps indicating the exact regions in the OCT scan the model focused on when making its decision. In combination with an OCT segmentation model, this allows us to produce quantitative breakdowns of the specific retinal layers the model focused on for later review by an expert. Our work is the first to produce detailed quantitative explanations of the model's decisions in this way. Our combination of accuracy and interpretability can be clinically applied for better patient care.
    Towards a trustworthy, secure and reliable enclave for machine learning in a hospital setting: The Essen Medical Computing Platform (EMCP). (arXiv:2201.04816v1 [cs.CR])
    (2 min) AI/Computing at scale is a difficult problem, especially in a health care setting. We outline the requirements, planning and implementation choices as well as the guiding principles that led to the implementation of our secure research computing enclave, the Essen Medical Computing Platform (EMCP), affiliated with a major German hospital. Compliance, data privacy and usability were the immutable requirements of the system. We will discuss the features of our computing enclave and we will provide our recipe for groups wishing to adopt a similar setup.
    Emojich -- zero-shot emoji generation using Russian language: a technical report. (arXiv:2112.02448v2 [cs.CL] UPDATED)
    (2 min) This technical report presents a text-to-image neural network "Emojich" that generates emojis using captions in Russian language as a condition. We aim to keep the generalization ability of a pretrained big model ruDALL-E Malevich (XL) 1.3B parameters at the fine-tuning stage, while giving special style to the images generated. Here are presented some engineering methods, code realization, all hyper-parameters for reproducing results and a Telegram bot where everyone can create their own customized sets of stickers. Also, some newly generated emojis obtained by "Emojich" model are demonstrated.
    Improving VAE based molecular representations for compound property prediction. (arXiv:2201.04929v1 [cs.LG])
    (2 min) Collecting labeled data for many important tasks in chemoinformatics is time consuming and requires expensive experiments. In recent years, machine learning has been used to learn rich representations of molecules using large scale unlabeled molecular datasets and transfer the knowledge to solve the more challenging tasks with limited datasets. Variational autoencoders are one of the tools that have been proposed to perform the transfer for both chemical property prediction and molecular generation tasks. In this work we propose a simple method to improve chemical property prediction performance of machine learning models by incorporating additional information on correlated molecular descriptors in the representations learned by variational autoencoders. We verify the method on three property prediction asks. We explore the impact of the number of incorporated descriptors, correlation between the descriptors and the target properties, sizes of the datasets etc. Finally, we show the relation between the performance of property prediction models and the distance between property prediction dataset and the larger unlabeled dataset in the representation space.
    Online 3D Bin Packing with Constrained Deep Reinforcement Learning. (arXiv:2006.14978v5 [cs.LG] UPDATED)
    (2 min) We solve a challenging yet practically useful variant of 3D Bin Packing Problem (3D-BPP). In our problem, the agent has limited information about the items to be packed into the bin, and an item must be packed immediately after its arrival without buffering or readjusting. The item's placement also subjects to the constraints of collision avoidance and physical stability. We formulate this online 3D-BPP as a constrained Markov decision process. To solve the problem, we propose an effective and easy-to-implement constrained deep reinforcement learning (DRL) method under the actor-critic framework. In particular, we introduce a feasibility predictor to predict the feasibility mask for the placement actions and use it to modulate the action probabilities output by the actor during training. Such supervisions and transformations to DRL facilitate the agent to learn feasible policies efficiently. Our method can also be generalized e.g., with the ability to handle lookahead or items with different orientations. We have conducted extensive evaluation showing that the learned policy significantly outperforms the state-of-the-art methods. A user study suggests that our method attains a human-level performance.
    Generative time series models using Neural ODE in Variational Autoencoders. (arXiv:2201.04630v1 [cs.LG])
    (2 min) In this paper, we implement Neural Ordinary Differential Equations in a Variational Autoencoder setting for generative time series modeling. An object-oriented approach to the code was taken to allow for easier development and research and all code used in the paper can be found here: https://github.com/simonmoesorensen/neural-ode-project The results were initially recreated and the reconstructions compared to a baseline Long-Short Term Memory AutoEncoder. The model was then extended with a LSTM encoder and challenged by more complex data consisting of time series in the form of spring oscillations. The model showed promise, and was able to reconstruct true trajectories for all complexities of data with a smaller RMSE than the baseline model. However, it was able to capture the dynamic behavior of the time series for known data in the decoder but was not able to produce extrapolations following the true trajectory very well for any of the complexities of spring data. A final experiment was carried out where the model was also presented with 68 days of solar power production data, and was able to reconstruct just as well as the baseline, even when very little data is available. Finally, the models training time was compared to the baseline. It was found that for small amounts of data the NODE method was significantly slower at training than the baseline, while for larger amounts of data the NODE method would be equal or faster at training. The paper is ended with a future work section which describes the many natural extensions to the work presented in this paper, with examples being investigating further the importance of input data, including extrapolation in the baseline model or testing more specific model setups.
    Can we imitate the principal investor's behavior to learn option price?. (arXiv:2105.11376v2 [q-fin.PR] UPDATED)
    (2 min) This paper presents a framework of imitating the principal investor's behavior for optimal pricing and hedging options. We construct a non-deterministic Markov decision process for modeling stock price change driven by the principal investor's decision making. However, low signal-to-noise ratio and instability that are inherent in equity markets pose challenges to determine the state transition (stock price change) after executing an action (the principal investor's decision) as well as decide an action based on current state (spot price). In order to conquer these challenges, we resort to a Bayesian deep neural network for computing the predictive distribution of the state transition led by an action. Additionally, instead of exploring a state-action relationship to formulate a policy, we seek for an episode based visible-hidden state-action relationship to probabilistically imitate the principal investor's successive decision making. Unlike conventional option pricing that employs analytical stochastic processes or utilizes time series analysis to model and sample underlying stock price movements, our algorithm simulates stock price paths by imitating the principal investor's behavior which requires no preset probability distribution and fewer predetermined parameters. Eventually the optimal option price is learned by reinforcement learning to maximize the cumulative risk-adjusted return of a dynamically hedged portfolio over simulated price paths.
    Detection of brain tumors using machine learning algorithms. (arXiv:2201.04703v1 [eess.IV])
    (2 min) An algorithm capable of processing NMR images was developed for analysis using machine learning techniques to detect the presence of brain tumors.
    A Non-Classical Parameterization for Density Estimation Using Sample Moments. (arXiv:2201.04786v1 [stat.ML])
    (2 min) Moment methods are an important means of density estimation, but they are generally strongly dependent on the choice of feasible functions, which severely affects the performance. We propose a non-classical parameterization for density estimation using the sample moments, which does not require the choice of such functions. The parameterization is induced by the Kullback-Leibler distance, and the solution of it, which is proved to exist and be unique subject to simple prior that does not depend on data, can be obtained by convex optimization. Simulation results show the performance of the proposed estimator in estimating multi-modal densities which are mixtures of different types of functions.
    Applying Machine Learning and AI Explanations to Analyze Vaccine Hesitancy. (arXiv:2201.05070v1 [cs.LG])
    (2 min) The paper quantifies the impact of race, poverty, politics, and age on COVID-19 vaccination rates in counties in the continental US. Both, OLS regression analysis and Random Forest machine learning algorithms are applied to quantify factors for county-level vaccination hesitancy. The machine learning model considers joint effects of variables (race/ethnicity, partisanship, age, etc.) simultaneously to capture the unique combination of these factors on the vaccination rate. By implementing a state-of-the-art Artificial Intelligence Explanations (AIX) algorithm, it is possible to solve the black box problem with machine learning models and provide answers to the "how much" question for each measured impact factor in every county. For most counties, a higher percentage vote for Republicans, a greater African American population share, and a higher poverty rate lower the vaccination rate. While a higher Asian population share increases the predicted vaccination rate. The impact on the vaccination rate from the Hispanic population proportion is positive in the OLS model, but only positive for counties with a high Hispanic population (>65%) in the Random Forest model. Both the proportion of seniors and the one for young people in a county have a significant impact in the OLS model - positive and negative, respectively. In contrast, the impacts are ambiguous in the Random Forest model. Because results vary between geographies and since the AIX algorithm is able to quantify vaccine impacts individually for each county, this research can be tailored to local communities. An interactive online mapping dashboard that identifies impact factors for individual U.S. counties is available at https://www.cpp.edu/~clange/vacmap.html. It is apparent that the influence of impact factors is not universally the same across different geographies.
    Towards Automated Error Analysis: Learning to Characterize Errors. (arXiv:2201.05017v1 [cs.CL])
    (2 min) Characterizing the patterns of errors that a system makes helps researchers focus future development on increasing its accuracy and robustness. We propose a novel form of "meta learning" that automatically learns interpretable rules that characterize the types of errors that a system makes, and demonstrate these rules' ability to help understand and improve two NLP systems. Our approach works by collecting error cases on validation data, extracting meta-features describing these samples, and finally learning rules that characterize errors using these features. We apply our approach to VilBERT, for Visual Question Answering, and RoBERTa, for Common Sense Question Answering. Our system learns interpretable rules that provide insights into systemic errors these systems make on the given tasks. Using these insights, we are also able to "close the loop" and modestly improve performance of these systems.
    Active Learning-Based Multistage Sequential Decision-Making Model with Application on Common Bile Duct Stone Evaluation. (arXiv:2201.04807v1 [stat.ML])
    (2 min) Multistage sequential decision-making scenarios are commonly seen in the healthcare diagnosis process. In this paper, an active learning-based method is developed to actively collect only the necessary patient data in a sequential manner. There are two novelties in the proposed method. First, unlike the existing ordinal logistic regression model which only models a single stage, we estimate the parameters for all stages together. Second, it is assumed that the coefficients for common features in different stages are kept consistent. The effectiveness of the proposed method is validated in both a simulation study and a real case study. Compared with the baseline method where the data is modeled individually and independently, the proposed method improves the estimation efficiency by 62\%-1838\%. For both simulation and testing cohorts, the proposed method is more effective, stable, interpretable, and computationally efficient on parameter estimation. The proposed method can be easily extended to a variety of scenarios where decision-making can be done sequentially with only necessary information.
    Machine Learning-enhanced Efficient Spectroscopic Ellipsometry Modeling. (arXiv:2201.04933v1 [cond-mat.mtrl-sci])
    (2 min) Over the recent years, there has been an extensive adoption of Machine Learning (ML) in a plethora of real-world applications, ranging from computer vision to data mining and drug discovery. In this paper, we utilize ML to facilitate efficient film fabrication, specifically Atomic Layer Deposition (ALD). In order to make advances in ALD process development, which is utilized to generate thin films, and its subsequent accelerated adoption in industry, it is imperative to understand the underlying atomistic processes. Towards this end, in situ techniques for monitoring film growth, such as Spectroscopic Ellipsometry (SE), have been proposed. However, in situ SE is associated with complex hardware and, hence, is resource intensive. To address these challenges, we propose an ML-based approach to expedite film thickness estimation. The proposed approach has tremendous implications of faster data acquisition, reduced hardware complexity and easier integration of spectroscopic ellipsometry for in situ monitoring of film thickness deposition. Our experimental results involving SE of TiO2 demonstrate that the proposed ML-based approach furnishes promising thickness prediction accuracy results of 88.76% within +/-1.5 nm and 85.14% within +/-0.5 nm intervals. Furthermore, we furnish accuracy results up to 98% at lower thicknesses, which is a significant improvement over existing SE-based analysis, thereby making our solution a viable option for thickness estimation of ultrathin films.
    Functional Anomaly Detection: a Benchmark Study. (arXiv:2201.05115v1 [stat.ML])
    (2 min) The increasing automation in many areas of the Industry expressly demands to design efficient machine-learning solutions for the detection of abnormal events. With the ubiquitous deployment of sensors monitoring nearly continuously the health of complex infrastructures, anomaly detection can now rely on measurements sampled at a very high frequency, providing a very rich representation of the phenomenon under surveillance. In order to exploit fully the information thus collected, the observations cannot be treated as multivariate data anymore and a functional analysis approach is required. It is the purpose of this paper to investigate the performance of recent techniques for anomaly detection in the functional setup on real datasets. After an overview of the state-of-the-art and a visual-descriptive study, a variety of anomaly detection methods are compared. While taxonomies of abnormalities (e.g. shape, location) in the functional setup are documented in the literature, assigning a specific type to the identified anomalies appears to be a challenging task. Thus, strengths and weaknesses of the existing approaches are benchmarked in view of these highlighted types in a simulation study. Anomaly detection methods are next evaluated on two datasets, related to the monitoring of helicopters in flight and to the spectrometry of construction materials namely. The benchmark analysis is concluded by recommendation guidance for practitioners.
    Forecast-based Multi-aspect Framework for Multivariate Time-series Anomaly Detection. (arXiv:2201.04792v1 [cs.LG])
    (2 min) Today's cyber-world is vastly multivariate. Metrics collected at extreme varieties demand multivariate algorithms to properly detect anomalies. However, forecast-based algorithms, as widely proven approaches, often perform sub-optimally or inconsistently across datasets. A key common issue is they strive to be one-size-fits-all but anomalies are distinctive in nature. We propose a method that tailors to such distinction. Presenting FMUAD - a Forecast-based, Multi-aspect, Unsupervised Anomaly Detection framework. FMUAD explicitly and separately captures the signature traits of anomaly types - spatial change, temporal change and correlation change - with independent modules. The modules then jointly learn an optimal feature representation, which is highly flexible and intuitive, unlike most other models in the category. Extensive experiments show our FMUAD framework consistently outperforms other state-of-the-art forecast-based anomaly detectors.
    On neural network kernels and the storage capacity problem. (arXiv:2201.04669v1 [cond-mat.dis-nn])
    (2 min) In this short note, we reify the connection between work on the storage capacity problem in wide two-layer treelike neural networks and the rapidly-growing body of literature on kernel limits of wide neural networks. Concretely, we observe that the "effective order parameter" studied in the statistical mechanics literature is exactly equivalent to the infinite-width Neural Network Gaussian Process Kernel. This correspondence connects the expressivity and trainability of wide two-layer neural networks.
    Reconstructing Training Data with Informed Adversaries. (arXiv:2201.04845v1 [cs.CR])
    (2 min) Given access to a machine learning model, can an adversary reconstruct the model's training data? This work studies this question from the lens of a powerful informed adversary who knows all the training data points except one. By instantiating concrete attacks, we show it is feasible to reconstruct the remaining data point in this stringent threat model. For convex models (e.g. logistic regression), reconstruction attacks are simple and can be derived in closed-form. For more general models (e.g. neural networks), we propose an attack strategy based on training a reconstructor network that receives as input the weights of the model under attack and produces as output the target data point. We demonstrate the effectiveness of our attack on image classifiers trained on MNIST and CIFAR-10, and systematically investigate which factors of standard machine learning pipelines affect reconstruction success. Finally, we theoretically investigate what amount of differential privacy suffices to mitigate reconstruction attacks by informed adversaries. Our work provides an effective reconstruction attack that model developers can use to assess memorization of individual points in general settings beyond those considered in previous works (e.g. generative language models or access to training gradients); it shows that standard models have the capacity to store enough information to enable high-fidelity reconstruction of training data points; and it demonstrates that differential privacy can successfully mitigate such attacks in a parameter regime where utility degradation is minimal.
    Conditional Variational Autoencoder with Balanced Pre-training for Generative Adversarial Networks. (arXiv:2201.04809v1 [cs.CV])
    (2 min) Class imbalance occurs in many real-world applications, including image classification, where the number of images in each class differs significantly. With imbalanced data, the generative adversarial networks (GANs) leans to majority class samples. The two recent methods, Balancing GAN (BAGAN) and improved BAGAN (BAGAN-GP), are proposed as an augmentation tool to handle this problem and restore the balance to the data. The former pre-trains the autoencoder weights in an unsupervised manner. However, it is unstable when the images from different categories have similar features. The latter is improved based on BAGAN by facilitating supervised autoencoder training, but the pre-training is biased towards the majority classes. In this work, we propose a novel Conditional Variational Autoencoder with Balanced Pre-training for Generative Adversarial Networks (CAPGAN) as an augmentation tool to generate realistic synthetic images. In particular, we utilize a conditional convolutional variational autoencoder with supervised and balanced pre-training for the GAN initialization and training with gradient penalty. Our proposed method presents a superior performance of other state-of-the-art methods on the highly imbalanced version of MNIST, Fashion-MNIST, CIFAR-10, and two medical imaging datasets. Our method can synthesize high-quality minority samples in terms of Fr\'echet inception distance, structural similarity index measure and perceptual quality.
    Real-Time GPU-Accelerated Machine Learning Based Multiuser Detection for 5G and Beyond. (arXiv:2201.05024v1 [eess.SP])
    (2 min) Adaptive partial linear beamforming meets the need of 5G and future 6G applications for high flexibility and adaptability. Choosing an appropriate tradeoff between conflicting goals opens the recently proposed multiuser (MU) detection method. Due to their high spatial resolution, nonlinear beamforming filters can significantly outperform linear approaches in stationary scenarios with massive connectivity. However, a dramatic decrease in performance can be expected in high mobility scenarios because they are very susceptible to changes in the wireless channel. The robustness of linear filters is required, considering these changes. One way to respond appropriately is to use online machine learning algorithms. The theory of algorithms based on the adaptive projected subgradient method (APSM) is rich, and they promise accurate tracking capabilities in dynamic wireless environments. However, one of the main challenges comes from the real-time implementation of these algorithms, which involve projections on time-varying closed convex sets. While the projection operations are relatively simple, their vast number poses a challenge in ultralow latency (ULL) applications where latency constraints must be satisfied in every radio frame. Taking non-orthogonal multiple access (NOMA) systems as an example, this paper explores the acceleration of APSM-based algorithms through massive parallelization. The result is a GPU-accelerated real-time implementation of an orthogonal frequency-division multiplexing (OFDM)-based transceiver that enables detection latency of less than one millisecond and therefore complies with the requirements of 5G and beyond. To meet the stringent physical layer latency requirements, careful co-design of hardware and software is essential, especially in virtualized wireless systems with hardware accelerators.
    Recursive Least Squares for Training and Pruning Convolutional Neural Networks. (arXiv:2201.04813v1 [cs.LG])
    (2 min) Convolutional neural networks (CNNs) have succeeded in many practical applications. However, their high computation and storage requirements often make them difficult to deploy on resource-constrained devices. In order to tackle this issue, many pruning algorithms have been proposed for CNNs, but most of them can't prune CNNs to a reasonable level. In this paper, we propose a novel algorithm for training and pruning CNNs based on the recursive least squares (RLS) optimization. After training a CNN for some epochs, our algorithm combines inverse input autocorrelation matrices and weight matrices to evaluate and prune unimportant input channels or nodes layer by layer. Then, our algorithm will continue to train the pruned network, and won't do the next pruning until the pruned network recovers the full performance of the old network. Besides for CNNs, the proposed algorithm can be used for feedforward neural networks (FNNs). Three experiments on MNIST, CIFAR-10 and SVHN datasets show that our algorithm can achieve the more reasonable pruning and have higher learning efficiency than other four popular pruning algorithms.
  • stat.ML updates on arXiv.org

    Multi-task longitudinal forecasting with missing values on Alzheimer's Disease. (arXiv:2201.05040v1 [stat.ML])
    (2 min) Machine learning techniques typically applied to dementia forecasting lack in their capabilities to jointly learn several tasks, handle time dependent heterogeneous data and missing values. In this paper, we propose a framework using the recently presented SSHIBA model for jointly learning different tasks on longitudinal data with missing values. The method uses Bayesian variational inference to impute missing values and combine information of several views. This way, we can combine different data-views from different time-points in a common latent space and learn the relations between each time-point while simultaneously modelling and predicting several output variables. We apply this model to predict together diagnosis, ventricle volume, and clinical scores in dementia. The results demonstrate that SSHIBA is capable of learning a good imputation of the missing values and outperforming the baselines while simultaneously predicting three different tasks.
    Functional Anomaly Detection: a Benchmark Study. (arXiv:2201.05115v1 [stat.ML])
    (0 min) The increasing automation in many areas of the Industry expressly demands to design efficient machine-learning solutions for the detection of abnormal events. With the ubiquitous deployment of sensors monitoring nearly continuously the health of complex infrastructures, anomaly detection can now rely on measurements sampled at a very high frequency, providing a very rich representation of the phenomenon under surveillance. In order to exploit fully the information thus collected, the observations cannot be treated as multivariate data anymore and a functional analysis approach is required. It is the purpose of this paper to investigate the performance of recent techniques for anomaly detection in the functional setup on real datasets. After an overview of the state-of-the-art and a visual-descriptive study, a variety of anomaly detection methods are compared. While taxonomies of abnormalities (e.g. shape, location) in the functional setup are documented in the literature, assigning a specific type to the identified anomalies appears to be a challenging task. Thus, strengths and weaknesses of the existing approaches are benchmarked in view of these highlighted types in a simulation study. Anomaly detection methods are next evaluated on two datasets, related to the monitoring of helicopters in flight and to the spectrometry of construction materials namely. The benchmark analysis is concluded by recommendation guidance for practitioners.
    Partial Recovery in the Graph Alignment Problem. (arXiv:2007.00533v5 [stat.ML] UPDATED)
    (0 min) In this paper, we consider the graph alignment problem, which is the problem of recovering, given two graphs, a one-to-one mapping between nodes that maximizes edge overlap. This problem can be viewed as a noisy version of the well-known graph isomorphism problem and appears in many applications, including social network deanonymization and cellular biology. Our focus here is on partial recovery, i.e., we look for a one-to-one mapping which is correct on a fraction of the nodes of the graph rather than on all of them, and we assume that the two input graphs to the problem are correlated Erd\H{o}s-R\'enyi graphs of parameters $(n,q,s)$. Our main contribution is then to give necessary and sufficient conditions on $(n,q,s)$ under which partial recovery is possible with high probability as the number of nodes $n$ goes to infinity. In particular, we show that it is possible to achieve partial recovery in the $nqs=\Theta(1)$ regime under certain additional assumptions.
    DS-Sync: Addressing Network Bottlenecks with Divide-and-Shuffle Synchronization for Distributed DNN Training. (arXiv:2007.03298v2 [cs.LG] UPDATED)
    (0 min) Bulk synchronous parallel (BSP) is the de-facto paradigm for distributed DNN training in today's production clusters. However, due to the global synchronization nature, its performance can be significantly influenced by network bottlenecks caused by either static topology heterogeneity or dynamic bandwidth contentions. Existing solutions, either system-level optimizations strengthening BSP (e.g., Ring or Hierarchical All-reduce) or algorithmic optimizations replacing BSP (e.g., ASP or SSP, which relax the global barriers), do not completely solve the problem, as they may still suffer from communication inefficiency or risk convergence inaccuracy. In this paper, we present a novel divide-and-shuffle synchronization (DS-Sync) to realize communication efficiency without sacrificing convergence accuracy for distributed DNN training. At its heart, by taking into account the network bottlenecks, DS-Sync improves communication efficiency by dividing workers into non-overlap groups to synchronize independently in a bottleneck-free manner. Meanwhile, it maintains convergence accuracy by iteratively shuffling workers among different groups to ensure a global consensus. We theoretically prove that DS-Sync converges properly in non-convex and smooth conditions like DNN. We further implement DS-Sync and integrate it with PyTorch, and our testbed experiments show that DS-Sync can achieve up to $94\%$ improvements on the end-to-end training time with existing solutions while maintaining the same accuracy.
    On component interactions in two-stage recommender systems. (arXiv:2106.14979v3 [cs.IR] UPDATED)
    (0 min) Thanks to their scalability, two-stage recommenders are used by many of today's largest online platforms, including YouTube, LinkedIn, and Pinterest. These systems produce recommendations in two steps: (i) multiple nominators, tuned for low prediction latency, preselect a small subset of candidates from the whole item pool; (ii) a slower but more accurate ranker further narrows down the nominated items, and serves to the user. Despite their popularity, the literature on two-stage recommenders is relatively scarce, and the algorithms are often treated as mere sums of their parts. Such treatment presupposes that the two-stage performance is explained by the behavior of the individual components in isolation. This is not the case: using synthetic and real-world data, we demonstrate that interactions between the ranker and the nominators substantially affect the overall performance. Motivated by these findings, we derive a generalization lower bound which shows that independent nominator training can lead to performance on par with uniformly random recommendations. We find that careful design of item pools, each assigned to a different nominator, alleviates these issues. As manual search for a good pool allocation is difficult, we propose to learn one instead using a Mixture-of-Experts based approach. This significantly improves both precision and recall at K.
    Hyperparameter Importance for Machine Learning Algorithms. (arXiv:2201.05132v1 [stat.ML])
    (0 min) Hyperparameter plays an essential role in the fitting of supervised machine learning algorithms. However, it is computationally expensive to tune all the tunable hyperparameters simultaneously especially for large data sets. In this paper, we give a definition of hyperparameter importance that can be estimated by subsampling procedures. According to the importance, hyperparameters can then be tuned on the entire data set more efficiently. We show theoretically that the proposed importance on subsets of data is consistent with the one on the population data under weak conditions. Numerical experiments show that the proposed importance is consistent and can save a lot of computational resources.
    Online 3D Bin Packing with Constrained Deep Reinforcement Learning. (arXiv:2006.14978v5 [cs.LG] UPDATED)
    (0 min) We solve a challenging yet practically useful variant of 3D Bin Packing Problem (3D-BPP). In our problem, the agent has limited information about the items to be packed into the bin, and an item must be packed immediately after its arrival without buffering or readjusting. The item's placement also subjects to the constraints of collision avoidance and physical stability. We formulate this online 3D-BPP as a constrained Markov decision process. To solve the problem, we propose an effective and easy-to-implement constrained deep reinforcement learning (DRL) method under the actor-critic framework. In particular, we introduce a feasibility predictor to predict the feasibility mask for the placement actions and use it to modulate the action probabilities output by the actor during training. Such supervisions and transformations to DRL facilitate the agent to learn feasible policies efficiently. Our method can also be generalized e.g., with the ability to handle lookahead or items with different orientations. We have conducted extensive evaluation showing that the learned policy significantly outperforms the state-of-the-art methods. A user study suggests that our method attains a human-level performance.
    Planning in Observable POMDPs in Quasipolynomial Time. (arXiv:2201.04735v1 [cs.LG])
    (0 min) Partially Observable Markov Decision Processes (POMDPs) are a natural and general model in reinforcement learning that take into account the agent's uncertainty about its current state. In the literature on POMDPs, it is customary to assume access to a planning oracle that computes an optimal policy when the parameters are known, even though the problem is known to be computationally hard. Almost all existing planning algorithms either run in exponential time, lack provable performance guarantees, or require placing strong assumptions on the transition dynamics under every possible policy. In this work, we revisit the planning problem and ask: are there natural and well-motivated assumptions that make planning easy? Our main result is a quasipolynomial-time algorithm for planning in (one-step) observable POMDPs. Specifically, we assume that well-separated distributions on states lead to well-separated distributions on observations, and thus the observations are at least somewhat informative in each step. Crucially, this assumption places no restrictions on the transition dynamics of the POMDP; nevertheless, it implies that near-optimal policies admit quasi-succinct descriptions, which is not true in general (under standard hardness assumptions). Our analysis is based on new quantitative bounds for filter stability -- i.e. the rate at which an optimal filter for the latent state forgets its initialization. Furthermore, we prove matching hardness for planning in observable POMDPs under the Exponential Time Hypothesis.
    Continuous Herded Gibbs Sampling. (arXiv:2106.06430v2 [stat.ML] UPDATED)
    (0 min) Herding is a technique to sequentially generate deterministic samples from a probability distribution. In this work, we propose a continuous herded Gibbs sampler that combines kernel herding on continuous densities with the Gibbs sampling idea. Our algorithm allows for deterministically sampling from high-dimensional multivariate probability densities, without directly sampling from the joint density. Experiments with Gaussian mixture densities indicate that the L2 error decreases similarly to kernel herding, while the computation time is significantly lower, i.e., linear in the number of dimensions.
    The curse of overparametrization in adversarial training: Precise analysis of robust generalization for random features regression. (arXiv:2201.05149v1 [cs.LG])
    (0 min) Successful deep learning models often involve training neural network architectures that contain more parameters than the number of training samples. Such overparametrized models have been extensively studied in recent years, and the virtues of overparametrization have been established from both the statistical perspective, via the double-descent phenomenon, and the computational perspective via the structural properties of the optimization landscape. Despite the remarkable success of deep learning architectures in the overparametrized regime, it is also well known that these models are highly vulnerable to small adversarial perturbations in their inputs. Even when adversarially trained, their performance on perturbed inputs (robust generalization) is considerably worse than their best attainable performance on benign inputs (standard generalization). It is thus imperative to understand how overparametrization fundamentally affects robustness. In this paper, we will provide a precise characterization of the role of overparametrization on robustness by focusing on random features regression models (two-layer neural networks with random first layer weights). We consider a regime where the sample size, the input dimension and the number of parameters grow in proportion to each other, and derive an asymptotically exact formula for the robust generalization error when the model is adversarially trained. Our developed theory reveals the nontrivial effect of overparametrization on robustness and indicates that for adversarially trained random features models, high overparametrization can hurt robust generalization.
    A Comparative Study on Basic Elements of Deep Learning Models for Spatial-Temporal Traffic Forecasting. (arXiv:2111.07513v2 [cs.LG] UPDATED)
    (0 min) Traffic forecasting plays a crucial role in intelligent transportation systems. The spatial-temporal complexities in transportation networks make the problem especially challenging. The recently suggested deep learning models share basic elements such as graph convolution, graph attention, recurrent units, and/or attention mechanism. In this study, we designed an in-depth comparative study for four deep neural network models utilizing different basic elements. For base models, one RNN-based model and one attention-based model were chosen from previous literature. Then, the spatial feature extraction layers in the models were substituted with graph convolution and graph attention. To analyze the performance of each element in various environments, we conducted experiments on four real-world datasets - highway speed, highway flow, urban speed from a homogeneous road link network, and urban speed from a heterogeneous road link network. The results demonstrate that the RNN-based model and the attention-based model show a similar level of performance for short-term prediction, and the attention-based model outperforms the RNN in longer-term predictions. The choice of graph convolution and graph attention makes a larger difference in the RNN-based models. Also, our modified version of GMAN shows comparable performance with the original with less memory consumption.
    Pointwise Binary Classification with Pairwise Confidence Comparisons. (arXiv:2010.01875v4 [cs.LG] UPDATED)
    (0 min) To alleviate the data requirement for training effective binary classifiers in binary classification, many weakly supervised learning settings have been proposed. Among them, some consider using pairwise but not pointwise labels, when pointwise labels are not accessible due to privacy, confidentiality, or security reasons. However, as a pairwise label denotes whether or not two data points share a pointwise label, it cannot be easily collected if either point is equally likely to be positive or negative. Thus, in this paper, we propose a novel setting called pairwise comparison (Pcomp) classification, where we have only pairs of unlabeled data that we know one is more likely to be positive than the other. Firstly, we give a Pcomp data generation process, derive an unbiased risk estimator (URE) with theoretical guarantee, and further improve URE using correction functions. Secondly, we link Pcomp classification to noisy-label learning to develop a progressive URE and improve it by imposing consistency regularization. Finally, we demonstrate by experiments the effectiveness of our methods, which suggests Pcomp is a valuable and practically useful type of pairwise supervision besides the pairwise label.
    A Non-Classical Parameterization for Density Estimation Using Sample Moments. (arXiv:2201.04786v1 [stat.ML])
    (0 min) Moment methods are an important means of density estimation, but they are generally strongly dependent on the choice of feasible functions, which severely affects the performance. We propose a non-classical parameterization for density estimation using the sample moments, which does not require the choice of such functions. The parameterization is induced by the Kullback-Leibler distance, and the solution of it, which is proved to exist and be unique subject to simple prior that does not depend on data, can be obtained by convex optimization. Simulation results show the performance of the proposed estimator in estimating multi-modal densities which are mixtures of different types of functions.
    Hyperparameter Selection for Subsampling Bootstraps. (arXiv:2006.01786v2 [stat.ME] UPDATED)
    (0 min) Massive data analysis becomes increasingly prevalent, subsampling methods like BLB (Bag of Little Bootstraps) serves as powerful tools for assessing the quality of estimators for massive data. However, the performance of the subsampling methods are highly influenced by the selection of tuning parameters ( e.g., the subset size, number of resamples per subset ). In this article we develop a hyperparameter selection methodology, which can be used to select tuning parameters for subsampling methods. Specifically, by a careful theoretical analysis, we find an analytically simple and elegant relationship between the asymptotic efficiency of various subsampling estimators and their hyperparameters. This leads to an optimal choice of the hyperparameters. More specifically, for an arbitrarily specified hyperparameter set, we can improve it to be a new set of hyperparameters with no extra CPU time cost, but the resulting estimator's statistical efficiency can be much improved. Both simulation studies and real data analysis demonstrate the superior advantage of our method.
    Global Optimality Beyond Two Layers: Training Deep ReLU Networks via Convex Programs. (arXiv:2110.05518v2 [cs.LG] UPDATED)
    (0 min) Understanding the fundamental mechanism behind the success of deep neural networks is one of the key challenges in the modern machine learning literature. Despite numerous attempts, a solid theoretical analysis is yet to be developed. In this paper, we develop a novel unified framework to reveal a hidden regularization mechanism through the lens of convex optimization. We first show that the training of multiple three-layer ReLU sub-networks with weight decay regularization can be equivalently cast as a convex optimization problem in a higher dimensional space, where sparsity is enforced via a group $\ell_1$-norm regularization. Consequently, ReLU networks can be interpreted as high dimensional feature selection methods. More importantly, we then prove that the equivalent convex problem can be globally optimized by a standard convex optimization solver with a polynomial-time complexity with respect to the number of samples and data dimension when the width of the network is fixed. Finally, we numerically validate our theoretical results via experiments involving both synthetic and real datasets.
    Unlabeled Data Improves Adversarial Robustness. (arXiv:1905.13736v4 [stat.ML] UPDATED)
    (0 min) We demonstrate, theoretically and empirically, that adversarial robustness can significantly benefit from semisupervised learning. Theoretically, we revisit the simple Gaussian model of Schmidt et al. that shows a sample complexity gap between standard and robust classification. We prove that unlabeled data bridges this gap: a simple semisupervised learning procedure (self-training) achieves high robust accuracy using the same number of labels required for achieving high standard accuracy. Empirically, we augment CIFAR-10 with 500K unlabeled images sourced from 80 Million Tiny Images and use robust self-training to outperform state-of-the-art robust accuracies by over 5 points in (i) $\ell_\infty$ robustness against several strong attacks via adversarial training and (ii) certified $\ell_2$ and $\ell_\infty$ robustness via randomized smoothing. On SVHN, adding the dataset's own extra training set with the labels removed provides gains of 4 to 10 points, within 1 point of the gain from using the extra labels.
    Active Learning-Based Multistage Sequential Decision-Making Model with Application on Common Bile Duct Stone Evaluation. (arXiv:2201.04807v1 [stat.ML])
    (0 min) Multistage sequential decision-making scenarios are commonly seen in the healthcare diagnosis process. In this paper, an active learning-based method is developed to actively collect only the necessary patient data in a sequential manner. There are two novelties in the proposed method. First, unlike the existing ordinal logistic regression model which only models a single stage, we estimate the parameters for all stages together. Second, it is assumed that the coefficients for common features in different stages are kept consistent. The effectiveness of the proposed method is validated in both a simulation study and a real case study. Compared with the baseline method where the data is modeled individually and independently, the proposed method improves the estimation efficiency by 62\%-1838\%. For both simulation and testing cohorts, the proposed method is more effective, stable, interpretable, and computationally efficient on parameter estimation. The proposed method can be easily extended to a variety of scenarios where decision-making can be done sequentially with only necessary information.
    Generalized Kernel Ridge Regression for Long Term Causal Inference: Treatment Effects, Dose Responses, and Counterfactual Distributions. (arXiv:2201.05139v1 [econ.EM])
    (0 min) I propose kernel ridge regression estimators for long term causal inference, where a short term experimental data set containing randomized treatment and short term surrogates is fused with a long term observational data set containing short term surrogates and long term outcomes. I propose estimators of treatment effects, dose responses, and counterfactual distributions with closed form solutions in terms of kernel matrix operations. I allow covariates, treatment, and surrogates to be discrete or continuous, and low, high, or infinite dimensional. For long term treatment effects, I prove $\sqrt{n}$ consistency, Gaussian approximation, and semiparametric efficiency. For long term dose responses, I prove uniform consistency with finite sample rates. For long term counterfactual distributions, I prove convergence in distribution.
    Pushing the limits of self-supervised ResNets: Can we outperform supervised learning without labels on ImageNet?. (arXiv:2201.05119v1 [cs.CV])
    (0 min) Despite recent progress made by self-supervised methods in representation learning with residual networks, they still underperform supervised learning on the ImageNet classification benchmark, limiting their applicability in performance-critical settings. Building on prior theoretical insights from Mitrovic et al., 2021, we propose ReLICv2 which combines an explicit invariance loss with a contrastive objective over a varied set of appropriately constructed data views. ReLICv2 achieves 77.1% top-1 classification accuracy on ImageNet using linear evaluation with a ResNet50 architecture and 80.6% with larger ResNet models, outperforming previous state-of-the-art self-supervised approaches by a wide margin. Most notably, ReLICv2 is the first representation learning method to consistently outperform the supervised baseline in a like-for-like comparison using a range of standard ResNet architectures. Finally we show that despite using ResNet encoders, ReLICv2 is comparable to state-of-the-art self-supervised vision transformers.
    Spatiotemporal Clustering with Neyman-Scott Processes via Connections to Bayesian Nonparametric Mixture Models. (arXiv:2201.05044v1 [stat.ML])
    (0 min) Neyman-Scott process (NSP) are point process models that generate clusters of points in time or space. They are natural models for a wide range of phenomena, ranging from neural spike trains to document streams. The clustering property is achieved via a doubly stochastic formulation: first, a set of latent events is drawn from a Poisson process; then, each latent event generates a set of observed data points according to another Poisson process. This construction is similar to Bayesian nonparametric mixture models like the Dirichlet process mixture model (DPMM) in that the number of latent events (i.e. clusters) is a random variable, but the point process formulation makes the NSP especially well suited to modeling spatiotemporal data. While many specialized algorithms have been developed for DPMMs, comparatively fewer works have focused on inference in NSPs. Here, we present novel connections between NSPs and DPMMs, with the key link being a third class of Bayesian mixture models called mixture of finite mixture models (MFMMs). Leveraging this connection, we adapt the standard collapsed Gibbs sampling algorithm for DPMMs to enable scalable Bayesian inference on NSP models. We demonstrate the potential of Neyman-Scott processes on a variety of applications including sequence detection in neural spike trains and event detection in document streams.
    How Tight Can PAC-Bayes be in the Small Data Regime?. (arXiv:2106.03542v4 [stat.ML] UPDATED)
    (2 min) In this paper, we investigate the question: Given a small number of datapoints, for example N = 30, how tight can PAC-Bayes and test set bounds be made? For such small datasets, test set bounds adversely affect generalisation performance by withholding data from the training procedure. In this setting, PAC-Bayes bounds are especially attractive, due to their ability to use all the data to simultaneously learn a posterior and bound its generalisation risk. We focus on the case of i.i.d. data with a bounded loss and consider the generic PAC-Bayes theorem of Germain et al. While their theorem is known to recover many existing PAC-Bayes bounds, it is unclear what the tightest bound derivable from their framework is. For a fixed learning algorithm and dataset, we show that the tightest possible bound coincides with a bound considered by Catoni; and, in the more natural case of distributions over datasets, we establish a lower bound on the best bound achievable in expectation. Interestingly, this lower bound recovers the Chernoff test set bound if the posterior is equal to the prior. Moreover, to illustrate how tight these bounds can be, we study synthetic one-dimensional classification tasks in which it is feasible to meta-learn both the prior and the form of the bound to numerically optimise for the tightest bounds possible. We find that in this simple, controlled scenario, PAC-Bayes bounds are competitive with comparable, commonly used Chernoff test set bounds. However, the sharpest test set bounds still lead to better guarantees on the generalisation error than the PAC-Bayes bounds we consider.
    Implicit Bias of MSE Gradient Optimization in Underparameterized Neural Networks. (arXiv:2201.04738v1 [stat.ML])
    (2 min) We study the dynamics of a neural network in function space when optimizing the mean squared error via gradient flow. We show that in the underparameterized regime the network learns eigenfunctions of an integral operator $T_{K^\infty}$ determined by the Neural Tangent Kernel (NTK) at rates corresponding to their eigenvalues. For example, for uniformly distributed data on the sphere $S^{d - 1}$ and rotation invariant weight distributions, the eigenfunctions of $T_{K^\infty}$ are the spherical harmonics. Our results can be understood as describing a spectral bias in the underparameterized regime. The proofs use the concept of "Damped Deviations", where deviations of the NTK matter less for eigendirections with large eigenvalues due to the occurence of a damping factor. Aside from the underparameterized regime, the damped deviations point-of-view can be used to track the dynamics of the empirical risk in the overparameterized setting, allowing us to extend certain results in the literature. We conclude that damped deviations offers a simple and unifying perspective of the dynamics when optimizing the squared error.
    A robust kernel machine regression towards biomarker selection in multi-omics datasets of osteoporosis for drug discovery. (arXiv:2201.05060v1 [stat.ML])
    (2 min) Many statistical machine approaches could ultimately highlight novel features of the etiology of complex diseases by analyzing multi-omics data. However, they are sensitive to some deviations in distribution when the observed samples are potentially contaminated with adversarial corrupted outliers (e.g., a fictional data distribution). Likewise, statistical advances lag in supporting comprehensive data-driven analyses of complex multi-omics data integration. We propose a novel non-linear M-estimator-based approach, "robust kernel machine regression (RobKMR)," to improve the robustness of statistical machine regression and the diversity of fictional data to examine the higher-order composite effect of multi-omics datasets. We address a robust kernel-centered Gram matrix to estimate the model parameters accurately. We also propose a robust score test to assess the marginal and joint Hadamard product of features from multi-omics data. We apply our proposed approach to a multi-omics dataset of osteoporosis (OP) from Caucasian females. Experiments demonstrate that the proposed approach effectively identifies the inter-related risk factors of OP. With solid evidence (p-value = 0.00001), biological validations, network-based analysis, causal inference, and drug repurposing, the selected three triplets ((DKK1, SMTN, DRGX), (MTND5, FASTKD2, CSMD3), (MTND5, COG3, CSMD3)) are significant biomarkers and directly relate to BMD. Overall, the top three selected genes (DKK1, MTND5, FASTKD2) and one gene (SIDT1 at p-value= 0.001) significantly bond with four drugs- Tacrolimus, Ibandronate, Alendronate, and Bazedoxifene out of 30 candidates for drug repurposing in OP. Further, the proposed approach can be applied to any disease model where multi-omics datasets are available.
    Unifying Epidemic Models with Mixtures. (arXiv:2201.04960v1 [stat.ML])
    (2 min) The COVID-19 pandemic has emphasized the need for a robust understanding of epidemic models. Current models of epidemics are classified as either mechanistic or non-mechanistic: mechanistic models make explicit assumptions on the dynamics of disease, whereas non-mechanistic models make assumptions on the form of observed time series. Here, we introduce a simple mixture-based model which bridges the two approaches while retaining benefits of both. The model represents time series of cases and fatalities as a mixture of Gaussian curves, providing a flexible function class to learn from data compared to traditional mechanistic models. Although the model is non-mechanistic, we show that it arises as the natural outcome of a stochastic process based on a networked SIR framework. This allows learned parameters to take on a more meaningful interpretation compared to similar non-mechanistic models, and we validate the interpretations using auxiliary mobility data collected during the COVID-19 pandemic. We provide a simple learning algorithm to identify model parameters and establish theoretical results which show the model can be efficiently learned from data. Empirically, we find the model to have low prediction error. The model is available live at covidpredictions.mit.edu. Ultimately, this allows us to systematically understand the impacts of interventions on COVID-19, which is critical in developing data-driven solutions to controlling epidemics.
    Generalized Shape Metrics on Neural Representations. (arXiv:2110.14739v2 [stat.ML] UPDATED)
    (0 min) Understanding the operation of biological and artificial networks remains a difficult and important challenge. To identify general principles, researchers are increasingly interested in surveying large collections of networks that are trained on, or biologically adapted to, similar tasks. A standardized set of analysis tools is now needed to identify how network-level covariates -- such as architecture, anatomical brain region, and model organism -- impact neural representations (hidden layer activations). Here, we provide a rigorous foundation for these analyses by defining a broad family of metric spaces that quantify representational dissimilarity. Using this framework we modify existing representational similarity measures based on canonical correlation analysis to satisfy the triangle inequality, formulate a novel metric that respects the inductive biases in convolutional layers, and identify approximate Euclidean embeddings that enable network representations to be incorporated into essentially any off-the-shelf machine learning method. We demonstrate these methods on large-scale datasets from biology (Allen Institute Brain Observatory) and deep learning (NAS-Bench-101). In doing so, we identify relationships between neural representations that are interpretable in terms of anatomical features and model performance.
    Symmetric Sparse Boolean Matrix Factorization and Applications. (arXiv:2102.01570v3 [cs.LG] UPDATED)
    (0 min) In this work, we study a variant of nonnegative matrix factorization where we wish to find a symmetric factorization of a given input matrix into a sparse, Boolean matrix. Formally speaking, given $\mathbf{M}\in\mathbb{Z}^{m\times m}$, we want to find $\mathbf{W}\in\{0,1\}^{m\times r}$ such that $\| \mathbf{M} - \mathbf{W}\mathbf{W}^\top \|_0$ is minimized among all $\mathbf{W}$ for which each row is $k$-sparse. This question turns out to be closely related to a number of questions like recovering a hypergraph from its line graph, as well as reconstruction attacks for private neural network training. As this problem is hard in the worst-case, we study a natural average-case variant that arises in the context of these reconstruction attacks: $\mathbf{M} = \mathbf{W}\mathbf{W}^{\top}$ for $\mathbf{W}$ a random Boolean matrix with $k$-sparse rows, and the goal is to recover $\mathbf{W}$ up to column permutation. Equivalently, this can be thought of as recovering a uniformly random $k$-uniform hypergraph from its line graph. Our main result is a polynomial-time algorithm for this problem based on bootstrapping higher-order information about $\mathbf{W}$ and then decomposing an appropriate tensor. The key ingredient in our analysis, which may be of independent interest, is to show that such a matrix $\mathbf{W}$ has full column rank with high probability as soon as $m = \widetilde{\Omega}(r)$, which we do using tools from Littlewood-Offord theory and estimates for binary Krawtchouk polynomials.
    Real-Time GPU-Accelerated Machine Learning Based Multiuser Detection for 5G and Beyond. (arXiv:2201.05024v1 [eess.SP])
    (2 min) Adaptive partial linear beamforming meets the need of 5G and future 6G applications for high flexibility and adaptability. Choosing an appropriate tradeoff between conflicting goals opens the recently proposed multiuser (MU) detection method. Due to their high spatial resolution, nonlinear beamforming filters can significantly outperform linear approaches in stationary scenarios with massive connectivity. However, a dramatic decrease in performance can be expected in high mobility scenarios because they are very susceptible to changes in the wireless channel. The robustness of linear filters is required, considering these changes. One way to respond appropriately is to use online machine learning algorithms. The theory of algorithms based on the adaptive projected subgradient method (APSM) is rich, and they promise accurate tracking capabilities in dynamic wireless environments. However, one of the main challenges comes from the real-time implementation of these algorithms, which involve projections on time-varying closed convex sets. While the projection operations are relatively simple, their vast number poses a challenge in ultralow latency (ULL) applications where latency constraints must be satisfied in every radio frame. Taking non-orthogonal multiple access (NOMA) systems as an example, this paper explores the acceleration of APSM-based algorithms through massive parallelization. The result is a GPU-accelerated real-time implementation of an orthogonal frequency-division multiplexing (OFDM)-based transceiver that enables detection latency of less than one millisecond and therefore complies with the requirements of 5G and beyond. To meet the stringent physical layer latency requirements, careful co-design of hardware and software is essential, especially in virtualized wireless systems with hardware accelerators.
    Metric-Free Individual Fairness in Online Learning. (arXiv:2002.05474v5 [cs.LG] UPDATED)
    (0 min) We study an online learning problem subject to the constraint of individual fairness, which requires that similar individuals are treated similarly. Unlike prior work on individual fairness, we do not assume the similarity measure among individuals is known, nor do we assume that such measure takes a certain parametric form. Instead, we leverage the existence of an auditor who detects fairness violations without enunciating the quantitative measure. In each round, the auditor examines the learner's decisions and attempts to identify a pair of individuals that are treated unfairly by the learner. We provide a general reduction framework that reduces online classification in our model to standard online classification, which allows us to leverage existing online learning algorithms to achieve sub-linear regret and number of fairness violations. Surprisingly, in the stochastic setting where the data are drawn independently from a distribution, we are also able to establish PAC-style fairness and accuracy generalization guarantees (Yona and Rothblum [2018]), despite only having access to a very restricted form of fairness feedback. Our fairness generalization bound qualitatively matches the uniform convergence bound of Yona and Rothblum [2018], while also providing a meaningful accuracy generalization guarantee. Our results resolve an open question by Gillen et al. [2018] by showing that online learning under an unknown individual fairness constraint is possible even without assuming a strong parametric form of the underlying similarity measure.
    Combining Interventional and Observational Data Using Causal Reductions. (arXiv:2103.04786v2 [stat.ML] UPDATED)
    (2 min) Unobserved confounding is one of the main challenges when estimating causal effects. We propose a causal reduction method that, given a causal model, replaces an arbitrary number of possibly high-dimensional latent confounders with a single latent confounder that takes values in the same space as the treatment variable, without changing the observational and interventional distributions the causal model entails. This allows us to estimate the causal effect in a principled way from combined data without relying on the common but often unrealistic assumption that all confounders have been observed. We apply our causal reduction in three different settings. In the first setting, we assume the treatment and outcome to be discrete. The causal reduction then implies bounds between the observational and interventional distributions that can be exploited for estimation purposes. In certain cases with highly unbalanced observational samples, the accuracy of the causal effect estimate can be improved by incorporating observational data. Second, for continuous variables and assuming a linear-Gaussian model, we derive equality constraints for the parameters of the observational and interventional distributions. Third, for the general continuous setting (possibly nonlinear or non-Gaussian), we parameterize the reduced causal model using normalizing flows, a flexible class of easily invertible nonlinear transformations. We perform a series of experiments on synthetic data and find that in several cases the number of interventional samples can be reduced when adding observational training samples without sacrificing accuracy.
    Fractal Gaussian Networks: A sparse random graph model based on Gaussian Multiplicative Chaos. (arXiv:2008.03038v2 [stat.ML] UPDATED)
    (2 min) We propose a novel stochastic network model, called Fractal Gaussian Network (FGN), that embodies well-defined and analytically tractable fractal structures. Such fractal structures have been empirically observed in diverse applications. FGNs interpolate continuously between the popular purely random geometric graphs (a.k.a. the Poisson Boolean network), and random graphs with increasingly fractal behavior. In fact, they form a parametric family of sparse random geometric graphs that are parametrized by a fractality parameter which governs the strength of the fractal structure. FGNs are driven by the latent spatial geometry of Gaussian Multiplicative Chaos (GMC), a canonical model of fractality in its own right. We asymptotically characterize the expected number of edges, triangles, cliques and hub-and-spoke motifs in FGNs, unveiling a distinct pattern in their scaling with the size parameter of the network. We then examine the natural question of detecting the presence of fractality and the problem of parameter estimation based on observed network data, in addition to fundamental properties of the FGN as a random graph model. We also explore fractality in community structures by unveiling a natural stochastic block model in the setting of FGNs. Finally, we substantiate our results with phenomenological analysis of the FGN in the context of available scientific literature for fractality in networks, including applications to real-world massive network data.
  • Data Science Central

    A Framework for Understanding Transformation–Part I
    (4 min) Introduction In Agile Enterprise Risk Management, Risk-Based Thinking, Multi-Disciplinary Management and Digital Transformation I focus on several topics tied to transformation.  The overarching objective of transformation is business agility—the ability to identify opportunities and threats in your environment and react to them at speed.  In the volatile environment in which we are now living, this… Read More »A Framework for Understanding Transformation–Part I The post A Framework for Understanding Transformation–Part I appeared first on Data Science Central.
  • John D. Cook

    Multiserver queues with general probability distributions
    (2 min) Textbook presentations of queueing theory start by assuming that the time between customer arrivals and the time to serve a customer are both exponentially distributed. Maybe these assumptions are realistic enough, but maybe not. For example, maybe there’s a cutoff on service time. If someone takes too long, you tell them there’s no more you […] Multiserver queues with general probability distributions first appeared on John D. Cook.
    Queueing and Economies of Scale
    (2 min) If a single server is being pushed to the limit, adding a second server can drastically reduce wait times. This is true whether the server is a human serving food or a computer serving web pages. I first wrote about this here and I give more details here. What if you add an extra server […] Queueing and Economies of Scale first appeared on John D. Cook.
    Linear logic arithmetic
    (1 min) Linear logic has connectives not used in classical logic. The connectives ⊗and & are conjunctions, ⊕ and ⅋ are disjunctions, and ! and ? are analogous to the modal operators ◻ and ◇ (necessity and possibility). Another way to classify the connectives is to say ⊕ and & are called additive, ⊗ and ⅋ are […] Linear logic arithmetic first appeared on John D. Cook.
  • Artificial Intelligence

    Trained GPT-2 to write poems
    (2 min) BEHOLD! I'm presenting you... A terrible mix of Philips Sydney, William Blake and Emily Dickinson. Enjoy reading those terrible poems of GPT-2 o_o THE ECHOING GREEN My mistress calls me When I am in the sky, And I am just, and bright. She calls me when I am in the sky, And she sings My joys, my sorrows, To my songs you sweet child can hear. Oh! how sweet is the sound Of God! As he smiles his child appears, And smiles on me, and out of me shine. I previously shared the next one, but it feels nice to read. That's one of the best I guess. A MAN'S DIALOGUE The man's life revolves round a circle; There, in the middle, the circle of life Reveals itself, all its life, Except and within itself, The whole circle is alive. The AI... Oh Boy... What have you done GPT-2? Sweet is his wife, if sh…
    I built an AI bot that draws people’s dream jobs on Twitter
    (1 min) submitted by /u/maaartiin_mac [link] [comments]
    A Science Journalist’s Journey to Understand AI
    (0 min) submitted by /u/regalalgorithm [link] [comments]
    Meta AI Introduces AV-HuBERT: A State-Of-The-Art Self-Supervised Framework For Understanding Speech That Learns By Both Seeing And Hearing People Speak
    (1 min) AI is used for various speech recognition and understanding activities, ranging from enabling smart speakers to designing aids for persons who are deaf or have speech impairments. However, these speech comprehension algorithms frequently fail to perform well in everyday scenarios where we need them the most: when numerous people speak simultaneously or when there is a lot of background noise. Even advanced noise-canceling techniques are frequently ineffective against, for instance, the sound of the ocean on a beach trip or the background chatter of a noisy street market. Humans interpret speech better than AI in these situations because we use both our ears and our sight. For example, we could watch someone’s mouth move and intuitively know the voice we’re hearing is coming from her. That is why Meta AI is developing new conversational AI systems that, like us, can discern intricate relationships between what they see and what they hear in conversation. Continue Reading Paper 1: https://arxiv.org/abs/2201.01763? Paper 2: https://arxiv.org/abs/2201.02184? Github: https://facebookresearch.github.io/av_hubert/ ​ https://i.redd.it/zz4piiyzgwb81.gif submitted by /u/techsucker [link] [comments]
    Remove Unwanted Objects From High-Quality Images! (not only 256x256...!). LaMa explained
    (1 min) submitted by /u/OnlyProggingForFun [link] [comments]
    Need a book about artificial intelligence/neural network/soft computing written by an Indian author. If you have pdf please share!
    (1 min) Any book will suffice for me. But the books that I am looking for is this-: https://www.amazon.com/Introduction-Neural-Networks-Matlab-6-0/dp/0070591121 https://b-ok.asia/book/11849928/95de3e There is 1 book in b-ok.org but some guy has just posted the google books preview pdf of this book. lol. If you studied artificial intelligence/soft computing/neural networks in university can you share your book you used to pass university exam questions? I am from Nepal and our syllabus is copied from those Indian university colleges. So having their book would be helpful. But the thing is that Indian author's textbook are not available in Nepal lol. submitted by /u/barnyard9 [link] [comments]
    AI based diet and nutrition planner?
    (0 min) Any app or website that has a mature AI engine to recommend diet and nutrition plan? Or an open source engine that does it when trained. submitted by /u/purelibran [link] [comments]
  • Machine Learning

    [P] A tutorial on evolving network parameters with JAX
    (1 min) Hi there. I've been interested in evolutionary strategies for a while, and OpenAI's paper on them shows that they can be a pretty interesting alternative to RL. Now that I've worked with JAX a bit, I decided to put together a tutorial showcasing how I used its vectorization and parallelization to implement evolution pretty easily and get good performance on GPU/TPU devices. https://github.com/ttt733/nn-evolution-jax/blob/main/tutorial.md This isn't a tutorial on achieving results on an interesting task, but I just wanted to write about an optimization technique I think is somewhat overlooked. The focus is on walking through how I used JAX's functions to implement it. I don't go into the math behind evolutionary strategies because there are already lots of articles on particular strategies (like CMA-ES) and their performance, how they work, etc. I just use the simplistic strategies OpenAI called out in their post and used in their MuJoCo-solving code - you're free to experiment with more complex ones. And, if you know JAX better than I do, please feel free to correct me on anything I got wrong. (I know some of the vectorization stuff can get weird, so there's a chance that I vmap'd a little too aggressively.) Hope you enjoy! submitted by /u/kulili [link] [comments]
    [P] I made an AI twitter bot that draws people’s dream jobs for them.
    (2 min) submitted by /u/maaartiin_mac [link] [comments]
    [R] MatrixCalculus.org -- A tool for automatically computing Matrix Derivatives
    (1 min) I came across this site, from Andrew Gelman's blog (https://statmodeling.stat.columbia.edu/2020/06/03/online-matrix-derivative-calculator-vs-matrix-normal-stan/): http://www.matrixcalculus.org/ It'll do matrix calculus for you, and not only provide you the result, you can export it into either LaTeX or produce Python code! It's super useful for when you have to precompute things by "hand." There's an associated NeurIPS paper if you want the gory details: http://www.matrixcalculus.org/matrixcalculus.pdf submitted by /u/bikeskata [link] [comments]
    [D]Are there any effective machine learning methods that aren't copied from nature?
    (2 min) Recently have been struck by the fact that the two most powerful machine learning methods, neural networks and genetic algorithms, are partly just copied from nature (in concept at least, obviously was a ton of work by a lot of brilliant people). I guess there is a lot of machine learning that is basically just memorization or math but are there any other deep learning algorithm paradigms out there? Do you think there could be someday? submitted by /u/naxpouse [link] [comments]
    [D] Is there an open-source implementation of the Retrieval-Enhanced Transformer (RETRO)?
    (1 min) Wondering if there's an open-source implementation of the Retrieval-Enhanced Transformer (RETRO) [arxiv link]? I imagine that it may take some time to make its way to production libraries like Huggingface Transformers due to the external memory component. Anyone heard of people working on this? submitted by /u/bigvenn [link] [comments]
    [R] Detecting Twenty-thousand Classes using Image-level Supervision
    (0 min) submitted by /u/Illustrious_Row_9971 [link] [comments]
    [P] Built a dog poop detector for my backyard
    (2 min) Over winter break I started poking around online for ways to track dog poop in my backyard. I don't like having to walk around and hope I picked up all of it. Where I live it snows a lot, and poops get lost in the snow come new snowfall. I found some cool concept gadgets that people have made, but nothing that worked with just a security cam. So I built this poop detector and made a video about it. When some code I wrote detects my dog pooping it will remember the location and draw a circle where my dog pooped on a picture of my backyard. So over the course of a couple of months I have a bunch of circle on a picture of my backyard, where all my dog's poops are. So this coming spring I will know where to look! Check out the video if you care: https://www.youtube.com/watch?v=uWZu3rnj-kQ Figured I would share here, it was fun to work on. Is this something you would hook up to a security camera if it was simple? Curious. Also, check out DeepLabCut. My project wouldn't have been possible without it, and it's really cool: https://github.com/DeepLabCut/DeepLabCut submitted by /u/GoochCommander [link] [comments]
    [R] Deep Symbolic Regression for Recurrent Sequences
    (0 min) submitted by /u/hardmaru [link] [comments]
    [P] Simple Tensorflow Implementation of "Toward Spatially Unbiased Generative Models (ICCV 2021)"
    (1 min) ​ samples.gif Abstract Recent image generation models show remarkable generation performance. However, they mirror strong location preference in datasets, which we call spatial bias. Therefore, generators render poor samples at unseen locations and scales. We argue that the generators rely on their implicit positional encoding to render spatial content. From our observations, the generator’s implicit positional encoding is translation-variant, making the generator spatially biased. To address this issue, we propose injecting explicit positional encoding at each scale of the generator. By learning the spatially unbiased generator, we facilitate the robust use of generators in multiple tasks, such as GAN inversion, multi-scale generation, generation of arbitrary sizes and aspect ratios. Furthermore, we show that our method can also be applied to denoising diffusion probabilistic models. code : https://github.com/taki0112/Toward_spatial_unbiased-Tensorflow paper : https://arxiv.org/abs/2108.01285 submitted by /u/taki0112 [link] [comments]
  • Neural Networks, Deep Learning and Machine Learning

    Need a book about artificial intelligence/neural network/soft computing written by an Indian author. If you have pdf please share!
    (1 min) Any book will suffice for me. ​ But the books that I am looking for is this-: ​ [https://www.amazon.com/Introduction-Neural-Networks-Matlab-6-0/dp/0070591121](https://www.amazon.com/Introduction-Neural-Networks-Matlab-6-0/dp/0070591121) ​ [https://b-ok.asia/book/11849928/95de3e](https://b-ok.asia/book/11849928/95de3e) ​ There is 1 book in [b-ok.org](https://b-ok.org) but some guy has just posted the google books preview pdf of this book. ​ lol. ​ **If you studied artificial intelligence/soft computing/neural networks in university can you share your book you used to pass university exam questions? I am from Nepal and our syllabus is copied from those Indian university colleges. So having their book would be helpful. But the thing is that Indian author's textbook are not available in Nepal lol.** submitted by /u/barnyard9 [link] [comments]
  • Reinforcement Learning

    Seeking advice on my custom env (demo included)
    (1 min) Hi ! I m looking for some advices. I created a custom gym environement and ran it with stable baseline's PPO. My environement is rather simple I believe. Its a 2d Space with two dots in it. One dot is the agent ( blue dot) Another dot (red) is freely travelling vertically or horizontally (randomly selected at env init). The goal is to catch the red dot. The episode is considered done when the distance between both dot is under a small value. The observation space is continuous (x / y between 0 and 1) The action space is discrete ( up down left right) demo The agent is rather good at "chasing" but not good at "planning". For example it could position itself on the inbound trajectory of the red dot to "intercept" it but it fails to do so. Any idea how to get to that behavior ? Code https://github.com/Ouassimf/rl-tracker-2d Thanks for advices ! submitted by /u/Ouassimf [link] [comments]
    Expected returns don't match with actions, because the agent ouputs a 4x1 vector action space at each time step
    (1 min) I am working with an environment with a continuous action space. The agent needs to output 4 continuous values for the action space. I'm using REINFORCE, so I built a small net that, given a (31,)-sized observation space, outputs a 4x1 vector (https://www.pettingzoo.ml/sisl/multiwalker). My loss is: loss= -torch.sum(torch.log(prob_batch)*expected_returns_batch) So it requires an array of action probabilities for the actions that were taken and the discounted rewards. For this reason, I recomputes the action probabilities for all the states in the trajectory and subsets the action-probabilities associated with the actions that were actually taken with the following two lines of code: pred_batch = model(state_batch) prob_batch = pred_batch.gather(dim=1,index=action_batch .long().view(-1,1)).squeeze() However, I receive this error: RuntimeError: Size does not match at dimension 0 expected index [1116, 1] to be smaller than self [279, 4] apart from dimension 1 So the problem seems to be that the agent emits four actions at each time step, but the code expects one action only (1116 is indeed four times bigger than 279). How do I fix this? Thanks! submitted by /u/No_Possibility_7588 [link] [comments]
    Outputting a 4x1 continuous action space according to a probability distribution
    (1 min) Hi all, I am working with an environment with a continuous action space. The agent needs to output 4 continuous values for the action space. I'm using REINFORCE, so I built a small net that, given a (31,)-sized observation space, outputs a 4x1 vector (https://www.pettingzoo.ml/sisl/multiwalker). The reason why I'm getting confused is that, if the possible actions were 4 and the agent had to select 1, I could use that probability distribution to choose the one with the highest value. However, here the agent needs to output a 4x1 action space, so I'm not sure how to use the probability distribution. Thanks! submitted by /u/No_Possibility_7588 [link] [comments]
    Action space for multi units controlled by a single agent
    (1 min) Hi everyone, I have a gym environment where there are multiple units controlled by a single agent. These units can also create new units and the units may also die. Since the number of units may vary, I am wondering how to make an action space if my Agent have to take actions for each units in a single step. submitted by /u/Coder_Unknown [link] [comments]
    Can Rainbow learn a stochastic policy?
    (1 min) Hi, so I'm developing an RL algorithm, and I have a problem what to choosing Rainbow or PPO. According to Sutton and Barto's book, the DQN fails to learn a stochastic policy i.e. while in the state or observation 's' acting optimally would be taking action a1 with 60% probability and taking action b with 40 probability. Now since the DQN outputs always the optimal action (based on its value) and not the distribution of action probabilities I see that DQN can't learn optimal stochastic policy? I'm unfamiliar with Rainbow implementation, but does it solve that problem of DQN learning deterministic policy? submitted by /u/basic_r_user [link] [comments]
    PPO BipedalWalker-v3 not converging enough
    (2 min) I am training BipedalWalker-v3 agent with PPO algorithm and I cant get my agent to get reward 300 or higher. I implemented basic PPO with several upgrades: GAE Normalized advantage Value Function Loss Clipping Overall Loss Includes Entropy Loss Adam Learning Rate Annealing Mini-batch Updates The Epsilon Parameter of Adam Optimizer set to 1e-5 Normalize and clip observation Scale and clip reward One NN for policy, one for critic Results I get are shown in table below where Y axis is reward, X is one step (2048 steps, 10 updates). https://preview.redd.it/kf7jpy9qfub81.png?width=1710&format=png&auto=webp&s=784e49af512ae2e8534152461ff251de701dee48 I get really close, but never to over 300. Let alone average 300 reward over 100 episodes which is the goal. I tried to tune my hyperparameters but wasn't successful. I can provide other graphs, like policy loss, critic loss or mean 100 reward if that helps, but I need help to read these graphs so I can change my hyperparamaters or maybe implementation. Best i managed was reward of 300.1 in one step (and never again) and mean reward of 282 when test process tested NNs for 100 episodes. submitted by /u/TheGuy839 [link] [comments]
    Any tips for training policy gradient on solving mazes?
    (1 min) submitted by /u/Mithrandir2k16 [link] [comments]
    Question about Uber's POET (specifically, which version of BipedalWalker they used)
    (1 min) I'm wondering if their code used V2 or V3. I will ask them if no one here knows, but I thought I would be faster to ask here. Apparently V2 had a pretty terrible bug with the LIDARs, so I don't want to reuse POET's code before making sure they used the right version Thanks! submitted by /u/DuchessOfNull [link] [comments]
    Definition of an Episodic Task
    (1 min) Hi, Just wondering if anyone has come across a formal definition for episodic tasks in reinforcement learning? Sutton's (excellent) book talks about episodic tasks but seems to not formally define them. The questions I'm trying to answer are: (1) does episodic imply episodes are time-limited to some (known) length H (one author uses this)? (2) Does it just imply termination in a finite (but arbitarly large) number of steps? (3) Does it imply neither of these? My thinking is that ergodic + at least one terminal state + a finite number of states implies 2 (but an infinite number of states does not). Thanks submitted by /u/VirtualHat [link] [comments]

2022-01-14

  • cs.LG updates on arXiv.org

    Eigenvalue Distribution of Large Random Matrices Arising in Deep Neural Networks: Orthogonal Case. (arXiv:2201.04543v1 [stat.ML])
    (2 min) The paper deals with the distribution of singular values of the input-output Jacobian of deep untrained neural networks in the limit of their infinite width. The Jacobian is the product of random matrices where the independent rectangular weight matrices alternate with diagonal matrices whose entries depend on the corresponding column of the nearest neighbor weight matrix. The problem was considered in \cite{Pe-Co:18} for the Gaussian weights and biases and also for the weights that are Haar distributed orthogonal matrices and Gaussian biases. Basing on a free probability argument, it was claimed that in these cases the singular value distribution of the Jacobian in the limit of infinite width (matrix size) coincides with that of the analog of the Jacobian with special random but weight independent diagonal matrices, the case well known in random matrix theory. The claim was rigorously proved in \cite{Pa-Sl:21} for a quite general class of weights and biases with i.i.d. (including Gaussian) entries by using a version of the techniques of random matrix theory. In this paper we use another version of the techniques to justify the claim for random Haar distributed weight matrices and Gaussian biases.
    Zero-shot Audio Source Separation through Query-based Learning from Weakly-labeled Data. (arXiv:2112.07891v3 [cs.SD] UPDATED)
    (2 min) Deep learning techniques for separating audio into different sound sources face several challenges. Standard architectures require training separate models for different types of audio sources. Although some universal separators employ a single model to target multiple sources, they have difficulty generalizing to unseen sources. In this paper, we propose a three-component pipeline to train a universal audio source separator from a large, but weakly-labeled dataset: AudioSet. First, we propose a transformer-based sound event detection system for processing weakly-labeled training data. Second, we devise a query-based audio separation model that leverages this data for model training. Third, we design a latent embedding processor to encode queries that specify audio targets for separation, allowing for zero-shot generalization. Our approach uses a single model for source separation of multiple sound types, and relies solely on weakly-labeled data for training. In addition, the proposed audio separator can be used in a zero-shot setting, learning to separate types of audio sources that were never seen in training. To evaluate the separation performance, we test our model on MUSDB18, while training on the disjoint AudioSet. We further verify the zero-shot performance by conducting another experiment on audio source types that are held-out from training. The model achieves comparable Source-to-Distortion Ratio (SDR) performance to current supervised models in both cases.
    One-Step Abductive Multi-Target Learning with Diverse Noisy Samples: An Application to Tumour Segmentation for Breast Cancer. (arXiv:2110.10325v2 [cs.LG] UPDATED)
    (0 min) One-step abductive multi-target learning (OSAMTL) is an approach proposed to handle complex noisy labels. However, OSAMTL is not suitable for the situation where diverse noisy samples (DNS) are provided for a learning task. In this paper, giving definition of DNS, we propose one-step abductive multi-target learning with DNS (OSAMTL-DNS) to expand the original OSAMTL to a wider range of tasks that handle complex noisy labels. Applying OSAMTL-DNS to tumour segmentation for breast cancer in medical histopathology whole slide image analysis, we show that OSAMTL-DNS is able to enable various state-of-the-art approaches for learning from noisy labels to achieve significantly more rational predictions.
    Smoothness and continuity of cost functionals for ECG mismatch computation. (arXiv:2201.04487v1 [physics.med-ph])
    (0 min) The field of cardiac electrophysiology tries to abstract, describe and finally model the electrical characteristics of a heartbeat. With recent advances in cardiac electrophysiology, models have become more powerful and descriptive as ever. However, to advance to the field of inverse electrophysiological modeling, i.e. creating models from electrical measurements such as the ECG, the less investigated field of smoothness of the simulated ECGs w.r.t. model parameters need to be further explored. The present paper discusses smoothness in terms of the whole pipeline which describes how from physiological parameters, we arrive at the simulated ECG. Employing such a pipeline, we create a test-bench of a simplified idealized left ventricle model and demonstrate the most important factors for efficient inverse modeling through smooth cost functionals. Such knowledge will be important for designing and creating inverse models in future optimization and machine learning methods.
    Manifold learning via quantum dynamics. (arXiv:2112.11161v2 [quant-ph] UPDATED)
    (0 min) We introduce an algorithm for computing geodesics on sampled manifolds that relies on simulation of quantum dynamics on a graph embedding of the sampled data. Our approach exploits classic results in semiclassical analysis and the quantum-classical correspondence, and forms a basis for techniques to learn the manifold from which a dataset is sampled, and subsequently for nonlinear dimensionality reduction of high-dimensional datasets. We illustrate the new algorithm with data sampled from model manifolds and also by a clustering demonstration based on COVID-19 mobility data. Finally, our method reveals interesting connections between the discretization provided by data sampling and quantization.
    A Feature Extraction based Model for Hate Speech Identification. (arXiv:2201.04227v1 [cs.CL])
    (2 min) The detection of hate speech online has become an important task, as offensive language such as hurtful, obscene and insulting content can harm marginalized people or groups. This paper presents TU Berlin team experiments and results on the task 1A and 1B of the shared task on hate speech and offensive content identification in Indo-European languages 2021. The success of different Natural Language Processing models is evaluated for the respective subtasks throughout the competition. We tested different models based on recurrent neural networks in word and character levels and transfer learning approaches based on Bert on the provided dataset by the competition. Among the tested models that have been used for the experiments, the transfer learning-based models achieved the best results in both subtasks.
    Blackbox Post-Processing for Multiclass Fairness. (arXiv:2201.04461v1 [cs.LG])
    (2 min) Applying standard machine learning approaches for classification can produce unequal results across different demographic groups. When then used in real-world settings, these inequities can have negative societal impacts. This has motivated the development of various approaches to fair classification with machine learning models in recent years. In this paper, we consider the problem of modifying the predictions of a blackbox machine learning classifier in order to achieve fairness in a multiclass setting. To accomplish this, we extend the 'post-processing' approach in Hardt et al. 2016, which focuses on fairness for binary classification, to the setting of fair multiclass classification. We explore when our approach produces both fair and accurate predictions through systematic synthetic experiments and also evaluate discrimination-fairness tradeoffs on several publicly available real-world application datasets. We find that overall, our approach produces minor drops in accuracy and enforces fairness when the number of individuals in the dataset is high relative to the number of classes and protected groups.
    Flexible, Non-parametric Modeling Using Regularized Neural Networks. (arXiv:2012.11369v3 [cs.LG] UPDATED)
    (2 min) Non-parametric, additive models are able to capture complex data dependencies in a flexible, yet interpretable way. However, choosing the format of the additive components often requires non-trivial data exploration. Here, as an alternative, we propose PrAda-net, a one-hidden-layer neural network, trained with proximal gradient descent and adaptive lasso. PrAda-net automatically adjusts the size and architecture of the neural network to reflect the complexity and structure of the data. The compact network obtained by PrAda-net can be translated to additive model components, making it suitable for non-parametric statistical modelling with automatic model selection. We demonstrate PrAda-net on simulated data, where wecompare the test error performance, variable importance and variable subset identification properties of PrAda-net to other lasso-based regularization approaches for neural networks. We also apply PrAda-net to the massive U.K. black smoke data set, to demonstrate how PrAda-net can be used to model complex and heterogeneous data with spatial and temporal components. In contrast to classical, statistical non-parametric approaches, PrAda-net requires no preliminary modeling to select the functional forms of the additive components, yet still results in an interpretable model representation.
    SLISEMAP: Explainable Dimensionality Reduction. (arXiv:2201.04455v1 [cs.LG])
    (0 min) Existing explanation methods for black-box supervised learning models generally work by building local models that explain the models behaviour for a particular data item. It is possible to make global explanations, but the explanations may have low fidelity for complex models. Most of the prior work on explainable models has been focused on classification problems, with less attention on regression. We propose a new manifold visualization method, SLISEMAP, that at the same time finds local explanations for all of the data items and builds a two-dimensional visualization of model space such that the data items explained by the same model are projected nearby. We provide an open source implementation of our methods, implemented by using GPU-optimized PyTorch library. SLISEMAP works both on classification and regression models. We compare SLISEMAP to most popular dimensionality reduction methods and some local explanation methods. We provide mathematical derivation of our problem and show that SLISEMAP provides fast and stable visualizations that can be used to explain and understand black box regression and classification models.
    On generalization bounds for deep networks based on loss surface implicit regularization. (arXiv:2201.04545v1 [stat.ML])
    (0 min) The classical statistical learning theory says that fitting too many parameters leads to overfitting and poor performance. That modern deep neural networks generalize well despite a large number of parameters contradicts this finding and constitutes a major unsolved problem towards explaining the success of deep learning. The implicit regularization induced by stochastic gradient descent (SGD) has been regarded to be important, but its specific principle is still unknown. In this work, we study how the local geometry of the energy landscape around local minima affects the statistical properties of SGD with Gaussian gradient noise. We argue that under reasonable assumptions, the local geometry forces SGD to stay close to a low dimensional subspace and that this induces implicit regularization and results in tighter bounds on the generalization error for deep neural networks. To derive generalization error bounds for neural networks, we first introduce a notion of stagnation sets around the local minima and impose a local essential convexity property of the population risk. Under these conditions, lower bounds for SGD to remain in these stagnation sets are derived. If stagnation occurs, we derive a bound on the generalization error of deep neural networks involving the spectral norms of the weight matrices but not the number of network parameters. Technically, our proofs are based on controlling the change of parameter values in the SGD iterates and local uniform convergence of the empirical loss functions based on the entropy of suitable neighborhoods around local minima. Our work attempts to better connect non-convex optimization and generalization analysis with uniform convergence.
    Test for non-negligible adverse shifts. (arXiv:2107.02990v3 [stat.ML] UPDATED)
    (2 min) Statistical tests for dataset shift are susceptible to false alarms: they are sensitive to minor differences when there is in fact adequate sample coverage and predictive performance. We propose instead a framework to detect adverse dataset shifts based on outlier scores, $\texttt{D-SOS}$ for short. $\texttt{D-SOS}$ holds that the new (test) sample is not substantively worse than the reference (training) sample, and not that the two are equal. The key idea is to reduce observations to outlier scores and compare contamination rates at varying weighted thresholds. Users can define what $\it{worse}$ means in terms of relevant notions of outlyingness, including proxies for predictive performance. Compared to tests of equal distribution, our approach is uniquely tailored to serve as a robust metric for model monitoring and data validation. We show how versatile and practical $\texttt{D-SOS}$ is on a wide range of real and simulated data.
    Extensions to the Proximal Distance Method of Constrained Optimization. (arXiv:2009.00801v2 [math.OC] UPDATED)
    (2 min) The current paper studies the problem of minimizing a loss $f(\boldsymbol{x})$ subject to constraints of the form $\boldsymbol{D}\boldsymbol{x} \in S$, where $S$ is a closed set, convex or not, and $\boldsymbol{D}$ is a matrix that fuses parameters. Fusion constraints can capture smoothness, sparsity, or more general constraint patterns. To tackle this generic class of problems, we combine the Beltrami-Courant penalty method with the proximal distance principle. The latter is driven by minimization of penalized objectives $f(\boldsymbol{x})+\frac{\rho}{2}\text{dist}(\boldsymbol{D}\boldsymbol{x},S)^2$ involving large tuning constants $\rho$ and the squared Euclidean distance of $\boldsymbol{D}\boldsymbol{x}$ from $S$. The next iterate $\boldsymbol{x}_{n+1}$ of the corresponding proximal distance algorithm is constructed from the current iterate $\boldsymbol{x}_n$ by minimizing the majorizing surrogate function $f(\boldsymbol{x})+\frac{\rho}{2}\|\boldsymbol{D}\boldsymbol{x}-\mathcal{P}_{S}(\boldsymbol{D}\boldsymbol{x}_n)\|^2$. For fixed $\rho$ and a subanalytic loss $f(\boldsymbol{x})$ and a subanalytic constraint set $S$, we prove convergence to a stationary point. Under stronger assumptions, we provide convergence rates and demonstrate linear local convergence. We also construct a steepest descent (SD) variant to avoid costly linear system solves. To benchmark our algorithms, we compare against the alternating direction method of multipliers (ADMM). Our extensive numerical tests include problems on metric projection, convex regression, convex clustering, total variation image denoising, and projection of a matrix to a good condition number. These experiments demonstrate the superior speed and acceptable accuracy of our steepest variant on high-dimensional problems.
    Detecting Ransomware Execution in a Timely Manner. (arXiv:2201.04424v1 [cs.CR])
    (2 min) Ransomware has been an ongoing issue since the early 1990s. In recent times ransomware has spread from traditional computational resources to cyber-physical systems and industrial controls. We devised a series of experiments in which virtual instances are infected with ransomware. We instrumented the instances and collected resource utilization data across a variety of metrics (CPU, Memory, Disk Utility). We design a change point detection and learning method for identifying ransomware execution. Finally we evaluate and demonstrate its ability to detect ransomware efficiently in a timely manner when trained on a minimal set of samples. Our results represent a step forward for defense, and we conclude with further remarks for the path forward.
    Predicting Alzheimer's Disease Using 3DMgNet. (arXiv:2201.04370v1 [eess.IV])
    (2 min) Alzheimer's disease (AD) is an irreversible neurode generative disease of the brain.The disease may causes memory loss, difficulty communicating and disorientation. For the diagnosis of Alzheimer's disease, a series of scales are often needed to evaluate the diagnosis clinically, which not only increases the workload of doctors, but also makes the results of diagnosis highly subjective. Therefore, for Alzheimer's disease, imaging means to find early diagnostic markers has become a top priority. In this paper, we propose a novel 3DMgNet architecture which is a unified framework of multigrid and convolutional neural network to diagnose Alzheimer's disease (AD). The model is trained using an open dataset (ADNI dataset) and then test with a smaller dataset of ours. Finally, the model achieved 92.133% accuracy for AD vs NC classification and significantly reduced the model parameters.
    Online Continual Learning under Extreme Memory Constraints. (arXiv:2008.01510v3 [cs.CV] UPDATED)
    (2 min) Continual Learning (CL) aims to develop agents emulating the human ability to sequentially learn new tasks while being able to retain knowledge obtained from past experiences. In this paper, we introduce the novel problem of Memory-Constrained Online Continual Learning (MC-OCL) which imposes strict constraints on the memory overhead that a possible algorithm can use to avoid catastrophic forgetting. As most, if not all, previous CL methods violate these constraints, we propose an algorithmic solution to MC-OCL: Batch-level Distillation (BLD), a regularization-based CL approach, which effectively balances stability and plasticity in order to learn from data streams, while preserving the ability to solve old tasks through distillation. Our extensive experimental evaluation, conducted on three publicly available benchmarks, empirically demonstrates that our approach successfully addresses the MC-OCL problem and achieves comparable accuracy to prior distillation methods requiring higher memory overhead.
    Careful! Training Relevance is Real. (arXiv:2201.04429v1 [cs.LG])
    (2 min) There is a recent proliferation of research on the integration of machine learning and optimization. One expansive area within this research stream is predictive-model embedded optimization, which uses pre-trained predictive models for the objective function of an optimization problem, so that features of the predictive models become decision variables in the optimization problem. Despite a recent surge in publications in this area, one aspect of this decision-making pipeline that has been largely overlooked is training relevance, i.e., ensuring that solutions to the optimization problem should be similar to the data used to train the predictive models. In this paper, we propose constraints designed to enforce training relevance, and show through a collection of experimental results that adding the suggested constraints significantly improves the quality of solutions obtained.
    HyperTransformer: Model Generation for Supervised and Semi-Supervised Few-Shot Learning. (arXiv:2201.04182v1 [cs.LG])
    (2 min) In this work we propose a HyperTransformer, a transformer-based model for few-shot learning that generates weights of a convolutional neural network (CNN) directly from support samples. Since the dependence of a small generated CNN model on a specific task is encoded by a high-capacity transformer model, we effectively decouple the complexity of the large task space from the complexity of individual tasks. Our method is particularly effective for small target CNN architectures where learning a fixed universal task-independent embedding is not optimal and better performance is attained when the information about the task can modulate all model parameters. For larger models we discover that generating the last layer alone allows us to produce competitive or better results than those obtained with state-of-the-art methods while being end-to-end differentiable. Finally, we extend our approach to a semi-supervised regime utilizing unlabeled samples in the support set and further improving few-shot performance.
    Real-Time Style Modelling of Human Locomotion via Feature-Wise Transformations and Local Motion Phases. (arXiv:2201.04439v1 [cs.GR])
    (2 min) Controlling the manner in which a character moves in a real-time animation system is a challenging task with useful applications. Existing style transfer systems require access to a reference content motion clip, however, in real-time systems the future motion content is unknown and liable to change with user input. In this work we present a style modelling system that uses an animation synthesis network to model motion content based on local motion phases. An additional style modulation network uses feature-wise transformations to modulate style in real-time. To evaluate our method, we create and release a new style modelling dataset, 100STYLE, containing over 4 million frames of stylised locomotion data in 100 different styles that present a number of challenges for existing systems. To model these styles, we extend the local phase calculation with a contact-free formulation. In comparison to other methods for real-time style modelling, we show our system is more robust and efficient in its style representation while improving motion quality.
    RGRecSys: A Toolkit for Robustness Evaluation of Recommender Systems. (arXiv:2201.04399v1 [cs.IR])
    (2 min) Robust machine learning is an increasingly important topic that focuses on developing models resilient to various forms of imperfect data. Due to the pervasiveness of recommender systems in online technologies, researchers have carried out several robustness studies focusing on data sparsity and profile injection attacks. Instead, we propose a more holistic view of robustness for recommender systems that encompasses multiple dimensions - robustness with respect to sub-populations, transformations, distributional disparity, attack, and data sparsity. While there are several libraries that allow users to compare different recommender system models, there is no software library for comprehensive robustness evaluation of recommender system models under different scenarios. As our main contribution, we present a robustness evaluation toolkit, Robustness Gym for RecSys (RGRecSys -- https://www.github.com/salesforce/RGRecSys), that allows us to quickly and uniformly evaluate the robustness of recommender system models.
    Dynamic Price of Parking Service based on Deep Learning. (arXiv:2201.04188v1 [cs.LG])
    (0 min) The improvement of air-quality in urban areas is one of the main concerns of public government bodies. This concern emerges from the evidence between the air quality and the public health. Major efforts from government bodies in this area include monitoring and forecasting systems, banning more pollutant motor vehicles, and traffic limitations during the periods of low-quality air. In this work, a proposal for dynamic prices in regulated parking services is presented. The dynamic prices in parking service must discourage motor vehicles parking when low-quality episodes are predicted. For this purpose, diverse deep learning strategies are evaluated. They have in common the use of collective air-quality measurements for forecasting labels about air quality in the city. The proposal is evaluated by using economic parameters and deep learning quality criteria at Madrid (Spain).
    The Turing Trap: The Promise & Peril of Human-Like Artificial Intelligence. (arXiv:2201.04200v1 [econ.GN])
    (0 min) In 1950, Alan Turing proposed an imitation game as the ultimate test of whether a machine was intelligent: could a machine imitate a human so well that its answers to questions indistinguishable from a human. Ever since, creating intelligence that matches human intelligence has implicitly or explicitly been the goal of thousands of researchers, engineers, and entrepreneurs. The benefits of human-like artificial intelligence (HLAI) include soaring productivity, increased leisure, and perhaps most profoundly, a better understanding of our own minds. But not all types of AI are human-like. In fact, many of the most powerful systems are very different from humans. So an excessive focus on developing and deploying HLAI can lead us into a trap. As machines become better substitutes for human labor, workers lose economic and political bargaining power and become increasingly dependent on those who control the technology. In contrast, when AI is focused on augmenting humans rather than mimicking them, then humans retain the power to insist on a share of the value created. Furthermore, augmentation creates new capabilities and new products and services, ultimately generating far more value than merely human-like AI. While both types of AI can be enormously beneficial, there are currently excess incentives for automation rather than augmentation among technologists, business executives, and policymakers.
    Adaptive Worker Grouping For Communication-Efficient and Straggler-Tolerant Distributed SGD. (arXiv:2201.04301v1 [cs.IT])
    (0 min) Wall-clock convergence time and communication load are key performance metrics for the distributed implementation of stochastic gradient descent (SGD) in parameter server settings. Communication-adaptive distributed Adam (CADA) has been recently proposed as a way to reduce communication load via the adaptive selection of workers. CADA is subject to performance degradation in terms of wall-clock convergence time in the presence of stragglers. This paper proposes a novel scheme named grouping-based CADA (G-CADA) that retains the advantages of CADA in reducing the communication load, while increasing the robustness to stragglers at the cost of additional storage at the workers. G-CADA partitions the workers into groups of workers that are assigned the same data shards. Groups are scheduled adaptively at each iteration, and the server only waits for the fastest worker in each selected group. We provide analysis and experimental results to elaborate the significant gains on the wall-clock time, as well as communication load and computation load, of G-CADA over other benchmark schemes.
    Evolutionary Action Selection for Gradient-based Policy Learning. (arXiv:2201.04286v1 [cs.NE])
    (2 min) Evolutionary Algorithms (EAs) and Deep Reinforcement Learning (DRL) have recently been combined to integrate the advantages of the two solutions for better policy learning. However, in existing hybrid methods, EA is used to directly train the policy network, which will lead to sample inefficiency and unpredictable impact on the policy performance. To better integrate these two approaches and avoid the drawbacks caused by the introduction of EA, we devote ourselves to devising a more efficient and reasonable method of combining EA and DRL. In this paper, we propose Evolutionary Action Selection-Twin Delayed Deep Deterministic Policy Gradient (EAS-TD3), a novel combination of EA and DRL. In EAS, we focus on optimizing the action chosen by the policy network and attempt to obtain high-quality actions to guide policy learning through an evolutionary algorithm. We conduct several experiments on challenging continuous control tasks. The result shows that EAS-TD3 shows superior performance over other state-of-art methods.
    Two Wrongs Can Make a Right: A Transfer Learning Approach for Chemical Discovery with Chemical Accuracy. (arXiv:2201.04243v1 [physics.chem-ph])
    (2 min) Appropriately identifying and treating molecules and materials with significant multi-reference (MR) character is crucial for achieving high data fidelity in virtual high throughput screening (VHTS). Nevertheless, most VHTS is carried out with approximate density functional theory (DFT) using a single functional. Despite development of numerous MR diagnostics, the extent to which a single value of such a diagnostic indicates MR effect on chemical property prediction is not well established. We evaluate MR diagnostics of over 10,000 transition metal complexes (TMCs) and compare to those in organic molecules. We reveal that only some MR diagnostics are transferable across these materials spaces. By studying the influence of MR character on chemical properties (i.e., MR effect) that involves multiple potential energy surfaces (i.e., adiabatic spin splitting, $\Delta E_\mathrm{H-L}$, and ionization potential, IP), we observe that cancellation in MR effect outweighs accumulation. Differences in MR character are more important than the total degree of MR character in predicting MR effect in property prediction. Motivated by this observation, we build transfer learning models to directly predict CCSD(T)-level adiabatic $\Delta E_\mathrm{H-L}$ and IP from lower levels of theory. By combining these models with uncertainty quantification and multi-level modeling, we introduce a multi-pronged strategy that accelerates data acquisition by at least a factor of three while achieving chemical accuracy (i.e., 1 kcal/mol) for robust VHTS.
    Intra-domain and cross-domain transfer learning for time series data -- How transferable are the features?. (arXiv:2201.04449v1 [cs.LG])
    (2 min) In practice, it is very demanding and sometimes impossible to collect datasets of tagged data large enough to successfully train a machine learning model, and one possible solution to this problem is transfer learning. This study aims to assess how transferable are the features between different domains of time series data and under which conditions. The effects of transfer learning are observed in terms of predictive performance of the models and their convergence rate during training. In our experiment, we use reduced data sets of 1,500 and 9,000 data instances to mimic real world conditions. Using the same scaled-down datasets, we trained two sets of machine learning models: those that were trained with transfer learning and those that were trained from scratch. Four machine learning models were used for the experiment. Transfer of knowledge was performed within the same domain of application (seismology), as well as between mutually different domains of application (seismology, speech, medicine, finance). We observe the predictive performance of the models and the convergence rate during the training. In order to confirm the validity of the obtained results, we repeated the experiments seven times and applied statistical tests to confirm the significance of the results. The general conclusion of our study is that transfer learning is very likely to either increase or not negatively affect the predictive performance of the model or its convergence rate. The collected data is analysed in more details to determine which source and target domains are compatible for transfer of knowledge. We also analyse the effect of target dataset size and the selection of model and its hyperparameters on the effects of transfer learning.
    Neural Capacitance: A New Perspective of Neural Network Selection via Edge Dynamics. (arXiv:2201.04194v1 [cs.LG])
    (2 min) Efficient model selection for identifying a suitable pre-trained neural network to a downstream task is a fundamental yet challenging task in deep learning. Current practice requires expensive computational costs in model training for performance prediction. In this paper, we propose a novel framework for neural network selection by analyzing the governing dynamics over synaptic connections (edges) during training. Our framework is built on the fact that back-propagation during neural network training is equivalent to the dynamical evolution of synaptic connections. Therefore, a converged neural network is associated with an equilibrium state of a networked system composed of those edges. To this end, we construct a network mapping $\phi$, converting a neural network $G_A$ to a directed line graph $G_B$ that is defined on those edges in $G_A$. Next, we derive a neural capacitance metric $\beta_{\rm eff}$ as a predictive measure universally capturing the generalization capability of $G_A$ on the downstream task using only a handful of early training results. We carried out extensive experiments using 17 popular pre-trained ImageNet models and five benchmark datasets, including CIFAR10, CIFAR100, SVHN, Fashion MNIST and Birds, to evaluate the fine-tuning performance of our framework. Our neural capacitance metric is shown to be a powerful indicator for model selection based only on early training results and is more efficient than state-of-the-art methods.
    On Training Implicit Models. (arXiv:2111.05177v4 [cs.LG] UPDATED)
    (2 min) This paper focuses on training implicit models of infinite layers. Specifically, previous works employ implicit differentiation and solve the exact gradient for the backward propagation. However, is it necessary to compute such an exact but expensive gradient for training? In this work, we propose a novel gradient estimate for implicit models, named phantom gradient, that 1) forgoes the costly computation of the exact gradient; and 2) provides an update direction empirically preferable to the implicit model training. We theoretically analyze the condition under which an ascent direction of the loss landscape could be found, and provide two specific instantiations of the phantom gradient based on the damped unrolling and Neumann series. Experiments on large-scale tasks demonstrate that these lightweight phantom gradients significantly accelerate the backward passes in training implicit models by roughly 1.7 times, and even boost the performance over approaches based on the exact gradient on ImageNet.
    Interacting with Explanations through Critiquing. (arXiv:2005.11067v4 [cs.CL] UPDATED)
    (2 min) Using personalized explanations to support recommendations has been shown to increase trust and perceived quality. However, to actually obtain better recommendations, there needs to be a means for users to modify the recommendation criteria by interacting with the explanation. We present a novel technique using aspect markers that learns to generate personalized explanations of recommendations from review texts, and we show that human users significantly prefer these explanations over those produced by state-of-the-art techniques. Our work's most important innovation is that it allows users to react to a recommendation by critiquing the textual explanation: removing (symmetrically adding) certain aspects they dislike or that are no longer relevant (symmetrically that are of interest). The system updates its user model and the resulting recommendations according to the critique. This is based on a novel unsupervised critiquing method for single- and multi-step critiquing with textual explanations. Experiments on two real-world datasets show that our system is the first to achieve good performance in adapting to the preferences expressed in multi-step critiquing.
    Dynamical Audio-Visual Navigation: Catching Unheard Moving Sound Sources in Unmapped 3D Environments. (arXiv:2201.04279v1 [cs.CV])
    (2 min) Recent work on audio-visual navigation targets a single static sound in noise-free audio environments and struggles to generalize to unheard sounds. We introduce the novel dynamic audio-visual navigation benchmark in which an embodied AI agent must catch a moving sound source in an unmapped environment in the presence of distractors and noisy sounds. We propose an end-to-end reinforcement learning approach that relies on a multi-modal architecture that fuses the spatial audio-visual information from a binaural audio signal and spatial occupancy maps to encode the features needed to learn a robust navigation policy for our new complex task settings. We demonstrate that our approach outperforms the current state-of-the-art with better generalization to unheard sounds and better robustness to noisy scenarios on the two challenging 3D scanned real-world datasets Replica and Matterport3D, for the static and dynamic audio-visual navigation benchmarks. Our novel benchmark will be made available at this http URL
    Predicting Terrorist Attacks in the United States using Localized News Data. (arXiv:2201.04292v1 [cs.LG])
    (2 min) Dozens of terrorist attacks are perpetrated in the United States every year, often causing fatalities and other significant damage. Toward the end of better understanding and mitigating these attacks, we present a set of machine learning models that learn from localized news data in order to predict whether a terrorist attack will occur on a given calendar date and in a given state. The best model--a Random Forest that learns from a novel variable-length moving average representation of the feature space--achieves area under the receiver operating characteristic scores $> .667$ on four of the five states that were impacted most by terrorism between 2015 and 2018. Our key findings include that modeling terrorism as a set of independent events, rather than as a continuous process, is a fruitful approach--especially when the events are sparse and dissimilar. Additionally, our results highlight the need for localized models that account for differences between locations. From a machine learning perspective, we found that the Random Forest model outperformed several deep models on our multimodal, noisy, and imbalanced data set, thus demonstrating the efficacy of our novel feature representation method in such a context. We also show that its predictions are relatively robust to time gaps between attacks and observed characteristics of the attacks. Finally, we analyze factors that limit model performance, which include a noisy feature space and small amount of available data. These contributions provide an important foundation for the use of machine learning in efforts against terrorism in the United States and beyond.
    Fighting Money-Laundering with Statistics and Machine Learning: An Introduction and Review. (arXiv:2201.04207v1 [stat.ML])
    (2 min) Money laundering is a profound, global problem. Nonetheless, there is little statistical and machine learning research on the topic. In this paper, we focus on anti-money laundering in banks. To help organize existing research in the field, we propose a unifying terminology and provide a review of the literature. This is structured around two central tasks: (i) client risk profiling and (ii) suspicious behavior flagging. We find that client risk profiling is characterized by diagnostics, i.e., efforts to find and explain risk factors. Suspicious behavior flagging, on the other hand, is characterized by non-disclosed features and hand-crafted risk indices. Finally, we discuss directions for future research. One major challenge is the lack of public data sets. This may, potentially, be addressed by synthetic data generation. Other possible research directions include semi-supervised and deep learning, interpretability and fairness of the results.
    Learning Robust Policies for Generalized Debris Capture with an Automated Tether-Net System. (arXiv:2201.04180v1 [cs.RO])
    (2 min) Tether-net launched from a chaser spacecraft provides a promising method to capture and dispose of large space debris in orbit. This tether-net system is subject to several sources of uncertainty in sensing and actuation that affect the performance of its net launch and closing control. Earlier reliability-based optimization approaches to design control actions however remain challenging and computationally prohibitive to generalize over varying launch scenarios and target (debris) state relative to the chaser. To search for a general and reliable control policy, this paper presents a reinforcement learning framework that integrates a proximal policy optimization (PPO2) approach with net dynamics simulations. The latter allows evaluating the episodes of net-based target capture, and estimate the capture quality index that serves as the reward feedback to PPO2. Here, the learned policy is designed to model the timing of the net closing action based on the state of the moving net and the target, under any given launch scenario. A stochastic state transition model is considered in order to incorporate synthetic uncertainties in state estimation and launch actuation. Along with notable reward improvement during training, the trained policy demonstrates capture performance (over a wide range of launch/target scenarios) that is close to that obtained with reliability-based optimization run over an individual scenario.
    Robust Contrastive Learning against Noisy Views. (arXiv:2201.04309v1 [cs.CV])
    (2 min) Contrastive learning relies on an assumption that positive pairs contain related views, e.g., patches of an image or co-occurring multimodal signals of a video, that share certain underlying information about an instance. But what if this assumption is violated? The literature suggests that contrastive learning produces suboptimal representations in the presence of noisy views, e.g., false positive pairs with no apparent shared information. In this work, we propose a new contrastive loss function that is robust against noisy views. We provide rigorous theoretical justifications by showing connections to robust symmetric losses for noisy binary classification and by establishing a new contrastive bound for mutual information maximization based on the Wasserstein distance measure. The proposed loss is completely modality-agnostic and a simple drop-in replacement for the InfoNCE loss, which makes it easy to apply to existing contrastive frameworks. We show that our approach provides consistent improvements over the state-of-the-art on image, video, and graph contrastive learning benchmarks that exhibit a variety of real-world noise patterns.
    Improving Disentangled Text Representation Learning with Information-Theoretic Guidance. (arXiv:2006.00693v3 [cs.LG] UPDATED)
    (2 min) Learning disentangled representations of natural language is essential for many NLP tasks, e.g., conditional text generation, style transfer, personalized dialogue systems, etc. Similar problems have been studied extensively for other forms of data, such as images and videos. However, the discrete nature of natural language makes the disentangling of textual representations more challenging (e.g., the manipulation over the data space cannot be easily achieved). Inspired by information theory, we propose a novel method that effectively manifests disentangled representations of text, without any supervision on semantics. A new mutual information upper bound is derived and leveraged to measure dependence between style and content. By minimizing this upper bound, the proposed method induces style and content embeddings into two independent low-dimensional spaces. Experiments on both conditional text generation and text-style transfer demonstrate the high quality of our disentangled representation in terms of content and style preservation.
    Optimal Fixed-Budget Best Arm Identification using the Augmented Inverse Probability Estimator in Two-Armed Gaussian Bandits with Unknown Variances. (arXiv:2201.04469v1 [stat.ML])
    (2 min) We consider the fixed-budget best arm identification problem in two-armed Gaussian bandits with unknown variances. The tightest lower bound on the complexity and an algorithm whose performance guarantee matches the lower bound have long been open problems when the variances are unknown and when the algorithm is agnostic to the optimal proportion of the arm draws. In this paper, we propose a strategy comprising a sampling rule with randomized sampling (RS) following the estimated target allocation probabilities of arm draws and a recommendation rule using the augmented inverse probability weighting (AIPW) estimator, which is often used in the causal inference literature. We refer to our strategy as the RS-AIPW strategy. In the theoretical analysis, we first derive a large deviation principle for martingales, which can be used when the second moment converges in mean, and apply it to our proposed strategy. Then, we show that the proposed strategy is asymptotically optimal in the sense that the probability of misidentification achieves the lower bound by Kaufmann et al. (2016) when the sample size becomes infinitely large and the gap between the two arms goes to zero.
    Learning to Identify Top Elo Ratings: A Dueling Bandits Approach. (arXiv:2201.04480v1 [cs.LG])
    (2 min) The Elo rating system is widely adopted to evaluate the skills of (chess) game and sports players. Recently it has been also integrated into machine learning algorithms in evaluating the performance of computerised AI agents. However, an accurate estimation of the Elo rating (for the top players) often requires many rounds of competitions, which can be expensive to carry out. In this paper, to improve the sample efficiency of the Elo evaluation (for top players), we propose an efficient online match scheduling algorithm. Specifically, we identify and match the top players through a dueling bandits framework and tailor the bandit algorithm to the gradient-based update of Elo. We show that it reduces the per-step memory and time complexity to constant, compared to the traditional likelihood maximization approaches requiring $O(t)$ time. Our algorithm has a regret guarantee of $\tilde{O}(\sqrt{T})$, sublinear in the number of competition rounds and has been extended to the multidimensional Elo ratings for handling intransitive games. We empirically demonstrate that our method achieves superior convergence speed and time efficiency on a variety of gaming tasks.
    Leveraging Unlabeled Data to Predict Out-of-Distribution Performance. (arXiv:2201.04234v1 [cs.LG])
    (2 min) Real-world machine learning deployments are characterized by mismatches between the source (training) and target (test) distributions that may cause performance drops. In this work, we investigate methods for predicting the target domain accuracy using only labeled source data and unlabeled target data. We propose Average Thresholded Confidence (ATC), a practical method that learns a threshold on the model's confidence, predicting accuracy as the fraction of unlabeled examples for which model confidence exceeds that threshold. ATC outperforms previous methods across several model architectures, types of distribution shifts (e.g., due to synthetic corruptions, dataset reproduction, or novel subpopulations), and datasets (Wilds, ImageNet, Breeds, CIFAR, and MNIST). In our experiments, ATC estimates target performance $2$-$4\times$ more accurately than prior methods. We also explore the theoretical foundations of the problem, proving that, in general, identifying the accuracy is just as hard as identifying the optimal predictor and thus, the efficacy of any method rests upon (perhaps unstated) assumptions on the nature of the shift. Finally, analyzing our method on some toy distributions, we provide insights concerning when it works.
    Embedding Principle of Loss Landscape of Deep Neural Networks. (arXiv:2105.14573v3 [cs.LG] UPDATED)
    (2 min) Understanding the structure of loss landscape of deep neural networks (DNNs)is obviously important. In this work, we prove an embedding principle that the loss landscape of a DNN "contains" all the critical points of all the narrower DNNs. More precisely, we propose a critical embedding such that any critical point, e.g., local or global minima, of a narrower DNN can be embedded to a critical point/hyperplane of the target DNN with higher degeneracy and preserving the DNN output function. The embedding structure of critical points is independent of loss function and training data, showing a stark difference from other nonconvex problems such as protein-folding. Empirically, we find that a wide DNN is often attracted by highly-degenerate critical points that are embedded from narrow DNNs. The embedding principle provides an explanation for the general easy optimization of wide DNNs and unravels a potential implicit low-complexity regularization during the training. Overall, our work provides a skeleton for the study of loss landscape of DNNs and its implication, by which a more exact and comprehensive understanding can be anticipated in the near
    Optimizing Prediction of MGMT Promoter Methylation from MRI Scans using Adversarial Learning. (arXiv:2201.04416v1 [eess.IV])
    (2 min) Glioblastoma Multiforme (GBM) is a malignant brain cancer forming around 48% of al brain and Central Nervous System (CNS) cancers. It is estimated that annually over 13,000 deaths occur in the US due to GBM, making it crucial to have early diagnosis systems that can lead to predictable and effective treatment. The most common treatment after GBM diagnosis is chemotherapy, which works by sending rapidly dividing cells to apoptosis. However, this form of treatment is not effective when the MGMT promoter sequence is methylated, and instead leads to severe side effects decreasing patient survivability. Therefore, it is important to be able to identify the MGMT promoter methylation status through non-invasive magnetic resonance imaging (MRI) based machine learning (ML) models. This is accomplished using the Brain Tumor Segmentation (BraTS) 2021 dataset, which was recently used for an international Kaggle competition. We developed four primary models - two radiomic models and two CNN models - each solving the binary classification task with progressive improvements. We built a novel ML model termed as the Intermediate State Generator which was used to normalize the slice thicknesses of all MRI scans. With further improvements, our best model was able to achieve performance significantly ($p < 0.05$) better than the best performing Kaggle model with a 6% increase in average cross-validation accuracy. This improvement could potentially lead to a more informed choice of chemotherapy as a treatment option, prolonging lives of thousands of patients with GBM each year.
    Fine-grained Graph Learning for Multi-view Subspace Clustering. (arXiv:2201.04604v1 [cs.LG])
    (2 min) Multi-view subspace clustering has conventionally focused on integrating heterogeneous feature descriptions to capture higher-dimensional information. One popular strategy is to generate a common subspace from different views and then apply graph-based approaches to deal with clustering. However, the performance of these methods is still subject to two limitations, namely the multiple views fusion pattern and the connection between the fusion process and clustering tasks. To address these problems, we propose a novel multi-view subspace clustering framework via fine-grained graph learning, which can tell the consistency of local structures between different views and integrate all views more delicately than previous weight regularizations. Different from other models in the literature, the point-level graph regularization and the reformulation of spectral clustering are introduced to perform graphs fusion and learn the shared cluster structure together. Extensive experiments on five real-world datasets show that the proposed framework has comparable performance to the SOTA algorithms.
    Dyna-T: Dyna-Q and Upper Confidence Bounds Applied to Trees. (arXiv:2201.04502v1 [cs.LG])
    (2 min) In this work we present a preliminary investigation of a novel algorithm called Dyna-T. In reinforcement learning (RL) a planning agent has its own representation of the environment as a model. To discover an optimal policy to interact with the environment, the agent collects experience in a trial and error fashion. Experience can be used for learning a better model or improve directly the value function and policy. Typically separated, Dyna-Q is an hybrid approach which, at each iteration, exploits the real experience to update the model as well as the value function, while planning its action using simulated data from its model. However, the planning process is computationally expensive and strongly depends on the dimensionality of the state-action space. We propose to build a Upper Confidence Tree (UCT) on the simulated experience and search for the best action to be selected during the on-line learning process. We prove the effectiveness of our proposed method on a set of preliminary tests on three testbed environments from Open AI. In contrast to Dyna-Q, Dyna-T outperforms state-of-the-art RL agents in the stochastic environments by choosing a more robust action selection strategy.
    Multi-task Joint Strategies of Self-supervised Representation Learning on Biomedical Networks for Drug Discovery. (arXiv:2201.04437v1 [cs.LG])
    (2 min) Self-supervised representation learning (SSL) on biomedical networks provides new opportunities for drug discovery which is lack of available biological or clinic phenotype. However, how to effectively combine multiple SSL models is challenging and rarely explored. Therefore, we propose multi-task joint strategies of self-supervised representation learning on biomedical networks for drug discovery, named MSSL2drug. We design six basic SSL tasks that are inspired by various modality features including structures, semantics, and attributes in biomedical heterogeneous networks. In addition, fifteen combinations of multiple tasks are evaluated by a graph attention-based adversarial multi-task learning framework in two drug discovery scenarios. The results suggest two important findings. (1) The combinations of multimodal tasks achieve the best performance compared to other multi-task joint strategies. (2) The joint training of local and global SSL tasks yields higher performance than random task combinations. Therefore, we conjecture that the multimodal and local-global combination strategies can be regarded as a guideline for multi-task SSL to drug discovery.
    Multiview Transformers for Video Recognition. (arXiv:2201.04288v1 [cs.CV])
    (2 min) Video understanding requires reasoning at multiple spatiotemporal resolutions -- from short fine-grained motions to events taking place over longer durations. Although transformer architectures have recently advanced the state-of-the-art, they have not explicitly modelled different spatiotemporal resolutions. To this end, we present Multiview Transformers for Video Recognition (MTV). Our model consists of separate encoders to represent different views of the input video with lateral connections to fuse information across views. We present thorough ablation studies of our model and show that MTV consistently performs better than single-view counterparts in terms of accuracy and computational cost across a range of model sizes. Furthermore, we achieve state-of-the-art results on five standard datasets, and improve even further with large-scale pretraining. We will release code and pretrained checkpoints.
    Preventing Manifold Intrusion with Locality: Local Mixup. (arXiv:2201.04368v1 [cs.LG])
    (2 min) Mixup is a data-dependent regularization technique that consists in linearly interpolating input samples and associated outputs. It has been shown to improve accuracy when used to train on standard machine learning datasets. However, authors have pointed out that Mixup can produce out-of-distribution virtual samples and even contradictions in the augmented training set, potentially resulting in adversarial effects. In this paper, we introduce Local Mixup in which distant input samples are weighted down when computing the loss. In constrained settings we demonstrate that Local Mixup can create a trade-off between bias and variance, with the extreme cases reducing to vanilla training and classical Mixup. Using standardized computer vision benchmarks , we also show that Local Mixup can improve test accuracy.
    Towards Adversarially Robust Deep Image Denoising. (arXiv:2201.04397v1 [eess.IV])
    (2 min) This work systematically investigates the adversarial robustness of deep image denoisers (DIDs), i.e, how well DIDs can recover the ground truth from noisy observations degraded by adversarial perturbations. Firstly, to evaluate DIDs' robustness, we propose a novel adversarial attack, namely Observation-based Zero-mean Attack ({\sc ObsAtk}), to craft adversarial zero-mean perturbations on given noisy images. We find that existing DIDs are vulnerable to the adversarial noise generated by {\sc ObsAtk}. Secondly, to robustify DIDs, we propose an adversarial training strategy, hybrid adversarial training ({\sc HAT}), that jointly trains DIDs with adversarial and non-adversarial noisy data to ensure that the reconstruction quality is high and the denoisers around non-adversarial data are locally smooth. The resultant DIDs can effectively remove various types of synthetic and adversarial noise. We also uncover that the robustness of DIDs benefits their generalization capability on unseen real-world noise. Indeed, {\sc HAT}-trained DIDs can recover high-quality clean images from real-world noise even without training on real noisy data. Extensive experiments on benchmark datasets, including Set68, PolyU, and SIDD, corroborate the effectiveness of {\sc ObsAtk} and {\sc HAT}.
    On the Statistical Complexity of Sample Amplification. (arXiv:2201.04315v1 [math.ST])
    (2 min) Given $n$ i.i.d. samples drawn from an unknown distribution $P$, when is it possible to produce a larger set of $n+m$ samples which cannot be distinguished from $n+m$ i.i.d. samples drawn from $P$? (Axelrod et al. 2019) formalized this question as the sample amplification problem, and gave optimal amplification procedures for discrete distributions and Gaussian location models. However, these procedures and associated lower bounds are tailored to the specific distribution classes, and a general statistical understanding of sample amplification is still largely missing. In this work, we place the sample amplification problem on a firm statistical foundation by deriving generally applicable amplification procedures, lower bound techniques and connections to existing statistical notions. Our techniques apply to a large class of distributions including the exponential family, and establish a rigorous connection between sample amplification and distribution learning.
    Brain Signals Analysis Based Deep Learning Methods: Recent advances in the study of non-invasive brain signals. (arXiv:2201.04229v1 [q-bio.NC])
    (2 min) Brain signals constitute the information that are processed by millions of brain neurons (nerve cells and brain cells). These brain signals can be recorded and analyzed using various of non-invasive techniques such as the Electroencephalograph (EEG), Magneto-encephalograph (MEG) as well as brain-imaging techniques such as Magnetic Resonance Imaging (MRI), Computed Tomography (CT) and others, which will be discussed briefly in this paper. This paper discusses about the currently emerging techniques such as the usage of different Deep Learning (DL) algorithms for the analysis of these brain signals and how these algorithms will be helpful in determining the neurological status of a person by applying the signal decoding strategy.
    ECONet: Efficient Convolutional Online Likelihood Network for Scribble-based Interactive Segmentation. (arXiv:2201.04584v1 [eess.IV])
    (2 min) Automatic segmentation of lung lesions associated with COVID-19 in CT images requires large amount of annotated volumes. Annotations mandate expert knowledge and are time-intensive to obtain through fully manual segmentation methods. Additionally, lung lesions have large inter-patient variations, with some pathologies having similar visual appearance as healthy lung tissues. This poses a challenge when applying existing semi-automatic interactive segmentation techniques for data labelling. To address these challenges, we propose an efficient convolutional neural networks (CNNs) that can be learned online while the annotator provides scribble-based interaction. To accelerate learning from only the samples labelled through user-interactions, a patch-based approach is used for training the network. Moreover, we use weighted cross-entropy loss to address the class imbalance that may result from user-interactions. During online inference, the learned network is applied to the whole input volume using a fully convolutional approach. We compare our proposed method with state-of-the-art and show that it outperforms existing methods on the task of annotating lung lesions associated with COVID-19, achieving 16% higher Dice score while reducing execution time by 3$\times$ and requiring 9000 lesser scribbles-based labelled voxels. Due to the online learning aspect, our approach adapts quickly to user input, resulting in high quality segmentation labels. Source code will be made available upon acceptance.
    SmartDet: Context-Aware Dynamic Control of Edge Task Offloading for Mobile Object Detection. (arXiv:2201.04235v1 [cs.DC])
    (2 min) Mobile devices increasingly rely on object detection (OD) through deep neural networks (DNNs) to perform critical tasks. Due to their high complexity, the execution of these DNNs requires excessive time and energy. Low-complexity object tracking (OT) can be used with OD, where the latter is periodically applied to generate "fresh" references for tracking. However, the frames processed with OD incur large delays, which may make the reference outdated and degrade tracking quality. Herein, we propose to use edge computing in this context, and establish parallel OT (at the mobile device) and OD (at the edge server) processes that are resilient to large OD latency. We propose Katch-Up, a novel tracking mechanism that improves the system resilience to excessive OD delay. However, while Katch-Up significantly improves performance, it also increases the computing load of the mobile device. Hence, we design SmartDet, a low-complexity controller based on deep reinforcement learning (DRL) that learns controlling the trade-off between resource utilization and OD performance. SmartDet takes as input context-related information related to the current video content and the current network conditions to optimize frequency and type of OD offloading, as well as Katch-Up utilization. We extensively evaluate SmartDet on a real-world testbed composed of a JetSon Nano as mobile device and a GTX 980 Ti as edge server, connected through a Wi-Fi link. Experimental results show that SmartDet achieves an optimal balance between tracking performance - mean Average Recall (mAR) and resource usage. With respect to a baseline with full Katch-Upusage and maximum channel usage, we still increase mAR by 4% while using 50% less of the channel and 30% power resources associated with Katch-Up. With respect to a fixed strategy using minimal resources, we increase mAR by 20% while using Katch-Up on 1/3 of the frames.
    Exact learning and test theory. (arXiv:2201.04506v1 [cs.CC])
    (2 min) In this paper, based on results of exact learning and test theory, we study arbitrary infinite binary information systems each of which consists of an infinite set of elements and an infinite set of two-valued functions (attributes) defined on the set of elements. We consider the notion of a problem over information system, which is described by a finite number of attributes: for a given element, we should recognize values of these attributes. As algorithms for problem solving, we consider decision trees of two types: (i) using only proper hypotheses (an analog of proper equivalence queries from exact learning), and (ii) using both attributes and proper hypotheses. As time complexity, we study the depth of decision trees. In the worst case, with the growth of the number of attributes in the problem description, the minimum depth of decision trees of both types either is bounded from above by a constant or grows as a logarithm, or linearly. Based on these results and results obtained earlier for attributes and arbitrary hypotheses, we divide the set of all infinite binary information systems into seven complexity classes.
    De-Noising of Photoacoustic Microscopy Images by Deep Learning. (arXiv:2201.04302v1 [eess.IV])
    (2 min) As a hybrid imaging technology, photoacoustic microscopy (PAM) imaging suffers from noise due to the maximum permissible exposure of laser intensity, attenuation of ultrasound in the tissue, and the inherent noise of the transducer. De-noising is a post-processing method to reduce noise, and PAM image quality can be recovered. However, previous de-noising techniques usually heavily rely on mathematical priors as well as manually selected parameters, resulting in unsatisfactory and slow de-noising performance for different noisy images, which greatly hinders practical and clinical applications. In this work, we propose a deep learning-based method to remove complex noise from PAM images without mathematical priors and manual selection of settings for different input images. An attention enhanced generative adversarial network is used to extract image features and remove various noises. The proposed method is demonstrated on both synthetic and real datasets, including phantom (leaf veins) and in vivo (mouse ear blood vessels and zebrafish pigment) experiments. The results show that compared with previous PAM de-noising methods, our method exhibits good performance in recovering images qualitatively and quantitatively. In addition, the de-noising speed of 0.016 s is achieved for an image with $256\times256$ pixels. Our approach is effective and practical for the de-noising of PAM images.
    Deep Symbolic Regression for Recurrent Sequences. (arXiv:2201.04600v1 [cs.LG])
    (2 min) Symbolic regression, i.e. predicting a function from the observation of its values, is well-known to be a challenging task. In this paper, we train Transformers to infer the function or recurrence relation underlying sequences of integers or floats, a typical task in human IQ tests which has hardly been tackled in the machine learning literature. We evaluate our integer model on a subset of OEIS sequences, and show that it outperforms built-in Mathematica functions for recurrence prediction. We also demonstrate that our float model is able to yield informative approximations of out-of-vocabulary functions and constants, e.g. $\operatorname{bessel0}(x)\approx \frac{\sin(x)+\cos(x)}{\sqrt{\pi x}}$ and $1.644934\approx \pi^2/6$. An interactive demonstration of our models is provided at https://bit.ly/3niE5FS.
    An Efficient and Adaptive Granular-ball Generation Method in Classification Problem. (arXiv:2201.04343v1 [cs.LG])
    (2 min) Granular-ball computing is an efficient, robust, and scalable learning method for granular computing. The basis of granular-ball computing is the granular-ball generation method. This paper proposes a method for accelerating the granular-ball generation using the division to replace $k$-means. It can greatly improve the efficiency of granular-ball generation while ensuring the accuracy similar to the existing method. Besides, a new adaptive method for the granular-ball generation is proposed by considering granular-ball's overlap eliminating and some other factors. This makes the granular-ball generation process of parameter-free and completely adaptive in the true sense. In addition, this paper first provides the mathematical models for the granular-ball covering. The experimental results on some real data sets demonstrate that the proposed two granular-ball generation methods have similar accuracies with the existing method while adaptiveness or acceleration is realized.
    Benchmarking Energy-Conserving Neural Networks for Learning Dynamics from Data. (arXiv:2012.02334v5 [cs.LG] UPDATED)
    (2 min) The last few years have witnessed an increased interest in incorporating physics-informed inductive bias in deep learning frameworks. In particular, a growing volume of literature has been exploring ways to enforce energy conservation while using neural networks for learning dynamics from observed time-series data. In this work, we survey ten recently proposed energy-conserving neural network models, including HNN, LNN, DeLaN, SymODEN, CHNN, CLNN and their variants. We provide a compact derivation of the theory behind these models and explain their similarities and differences. Their performance are compared in 4 physical systems. We point out the possibility of leveraging some of these energy-conserving models to design energy-based controllers.
    UniTTS: Residual Learning of Unified Embedding Space for Speech Style Control. (arXiv:2106.11171v2 [eess.AS] UPDATED)
    (2 min) We propose a novel high-fidelity expressive speech synthesis model, UniTTS, that learns and controls overlapping style attributes avoiding interference. UniTTS represents multiple style attributes in a single unified embedding space by the residuals between the phoneme embeddings before and after applying the attributes. The proposed method is especially effective in controlling multiple attributes that are difficult to separate cleanly, such as speaker ID and emotion, because it minimizes redundancy when adding variance in speaker ID and emotion, and additionally, predicts duration, pitch, and energy based on the speaker ID and emotion. In experiments, the visualization results exhibit that the proposed methods learned multiple attributes harmoniously in a manner that can be easily separated again. As well, UniTTS synthesized high-fidelity speech signals controlling multiple style attributes. The synthesized speech samples are presented at https://jackson-kang.github.io/paper_works/UniTTS/demos.
    Efficient and Interpretable Robot Manipulation with Graph Neural Networks. (arXiv:2102.13177v4 [cs.RO] UPDATED)
    (2 min) Manipulation tasks, like loading a dishwasher, can be seen as a sequence of spatial constraints and relationships between different objects. We aim to discover these rules from demonstrations by posing manipulation as a classification problem over a graph, whose nodes represent task-relevant entities like objects and goals, and present a graph neural network (GNN) policy architecture for solving this problem from demonstrations. In our experiments, a single GNN policy trained using imitation learning (IL) on 20 expert demos can solve blockstacking, rearrangement, and dishwasher loading tasks; once the policy has learned the spatial structure, it can generalize to a larger number of objects, goal configurations, and from simulation to the real world. These experiments show that graphical IL can solve complex long-horizon manipulation problems without requiring detailed task descriptions. Videos can be found at: https://youtu.be/POxaTDAj7aY.
    A semi-agnostic ansatz with variable structure for quantum machine learning. (arXiv:2103.06712v2 [quant-ph] UPDATED)
    (2 min) Quantum machine learning (QML) offers a powerful, flexible paradigm for programming near-term quantum computers, with applications in chemistry, metrology, materials science, data science, and mathematics. Here, one trains an ansatz, in the form of a parameterized quantum circuit, to accomplish a task of interest. However, challenges have recently emerged suggesting that deep ansatzes are difficult to train, due to flat training landscapes caused by randomness or by hardware noise. This motivates our work, where we present a variable structure approach to build ansatzes for QML. Our approach, called VAns (Variable Ansatz), applies a set of rules to both grow and (crucially) remove quantum gates in an informed manner during the optimization. Consequently, VAns is ideally suited to mitigate trainability and noise-related issues by keeping the ansatz shallow. We employ VAns in the variational quantum eigensolver for condensed matter and quantum chemistry applications and also in the quantum autoencoder for data compression, showing successful results in all cases.
    Trajectory-Constrained Deep Latent Visual Attention for Improved Local Planning in Presence of Heterogeneous Terrain. (arXiv:2112.04684v2 [cs.RO] UPDATED)
    (2 min) We present a reward-predictive, model-based deep learning method featuring trajectory-constrained visual attention for use in mapless, local visual navigation tasks. Our method learns to place visual attention at locations in latent image space which follow trajectories caused by vehicle control actions to enhance predictive accuracy during planning. The attention model is jointly optimized by the task-specific loss and an additional trajectory-constraint loss, allowing adaptability yet encouraging a regularized structure for improved generalization and reliability. Importantly, visual attention is applied in latent feature map space instead of raw image space to promote efficient planning. We validated our model in visual navigation tasks of planning low turbulence, collision-free trajectories in off-road settings and hill climbing with locking differentials in the presence of slippery terrain. Experiments involved randomized procedural generated simulation and real-world environments. We found our method improved generalization and learning efficiency when compared to no-attention and self-attention alternatives.
    Center Smoothing: Certified Robustness for Networks with Structured Outputs. (arXiv:2102.09701v3 [cs.LG] UPDATED)
    (2 min) The study of provable adversarial robustness has mostly been limited to classification tasks and models with one-dimensional real-valued outputs. We extend the scope of certifiable robustness to problems with more general and structured outputs like sets, images, language, etc. We model the output space as a metric space under a distance/similarity function, such as intersection-over-union, perceptual similarity, total variation distance, etc. Such models are used in many machine learning problems like image segmentation, object detection, generative models, image/audio-to-text systems, etc. Based on a robustness technique called randomized smoothing, our $\textit{center smoothing}$ procedure can produce models with the guarantee that the change in the output, as measured by the distance metric, remains small for any norm-bounded adversarial perturbation of the input. We apply our method to create certifiably robust models with disparate output spaces - from sets to images - and show that it yields meaningful certificates without significantly degrading the performance of the base model. Code for our experiments is available at: https://github.com/aounon/center-smoothing.
    Learning Domain Invariant Representations for Generalizable Person Re-Identification. (arXiv:2103.15890v3 [cs.CV] UPDATED)
    (2 min) Generalizable person Re-Identification (ReID) has attracted growing attention in recent computer vision community. In this work, we construct a structural causal model among identity labels, identity-specific factors (clothes/shoes color etc), and domain-specific factors (background, viewpoints etc). According to the causal analysis, we propose a novel Domain Invariant Representation Learning for generalizable person Re-Identification (DIR-ReID) framework. Specifically, we first propose to disentangle the identity-specific and domain-specific feature spaces, based on which we propose an effective algorithmic implementation for backdoor adjustment, essentially serving as a causal intervention towards the SCM. Extensive experiments have been conducted, showing that DIR-ReID outperforms state-of-the-art methods on large-scale domain generalization ReID benchmarks.
    Analysis of autocorrelation times in Neural Markov Chain Monte Carlo simulations. (arXiv:2111.10189v2 [cond-mat.stat-mech] UPDATED)
    (2 min) We provide a deepened study of autocorrelations in Neural Markov Chain Monte Carlo simulations, a version of the traditional Metropolis algorithm which employs neural networks to provide independent proposals. We illustrate our ideas using the two-dimensional Ising model. We propose several estimates of autocorrelation times, some inspired by analytical results derived for the Metropolized Independent Sampler, which we compare and study as a function of inverse temperature $\beta$. Based on that we propose an alternative loss function and study its impact on the autocorelation times. Furthermore, we investigate the impact of imposing system symmetries ($Z_2$ and/or translational) in the neural network training process on the autocorrelation times. Eventually, we propose a scheme which incorporates partial heat-bath updates. The impact of the above enhancements is discussed for a $16 \times 16$ spin system. The summary of our findings may serve as a guide to the implementation of Neural Markov Chain Monte Carlo simulations of more complicated models.
    On Calibration and Out-of-domain Generalization. (arXiv:2102.10395v4 [cs.LG] UPDATED)
    (2 min) Out-of-domain (OOD) generalization is a significant challenge for machine learning models. Many techniques have been proposed to overcome this challenge, often focused on learning models with certain invariance properties. In this work, we draw a link between OOD performance and model calibration, arguing that calibration across multiple domains can be viewed as a special case of an invariant representation leading to better OOD generalization. Specifically, we show that under certain conditions, models which achieve \emph{multi-domain calibration} are provably free of spurious correlations. This leads us to propose multi-domain calibration as a measurable and trainable surrogate for the OOD performance of a classifier. We therefore introduce methods that are easy to apply and allow practitioners to improve multi-domain calibration by training or modifying an existing model, leading to better performance on unseen domains. Using four datasets from the recently proposed WILDS OOD benchmark, as well as the Colored MNIST dataset, we demonstrate that training or tuning models so they are calibrated across multiple domains leads to significantly improved performance on unseen test domains. We believe this intriguing connection between calibration and OOD generalization is promising from both a practical and theoretical point of view.
    Holistic Deep Learning. (arXiv:2110.15829v2 [cs.LG] UPDATED)
    (2 min) There is much interest in deep learning to solve challenges that arise in applying neural network models in real-world environments. In particular, three areas have received considerable attention: adversarial robustness, parameter sparsity, and output stability. Despite numerous attempts on solving these problems independently, there is very little work addressing the challenges simultaneously. In this paper, we address this problem of constructing holistic deep learning models by proposing a novel formulation that solves these issues in combination. Real-world experiments on both tabular and MNIST dataset show that our formulation is able to simultaneously improve the accuracy, robustness, stability, and sparsity over traditional deep learning models among many others.
    ColO-RAN: Developing Machine Learning-based xApps for Open RAN Closed-loop Control on Programmable Experimental Platforms. (arXiv:2112.09559v2 [cs.NI] UPDATED)
    (2 min) In spite of the new opportunities brought about by the Open RAN, advances in ML-based network automation have been slow, mainly because of the unavailability of large-scale datasets and experimental testing infrastructure. This slows down the development and widespread adoption of Deep Reinforcement Learning (DRL) agents on real networks, delaying progress in intelligent and autonomous RAN control. In this paper, we address these challenges by proposing practical solutions and software pipelines for the design, training, testing, and experimental evaluation of DRL-based closed-loop control in the Open RAN. We introduce ColO-RAN, the first publicly-available large-scale O-RAN testing framework with software-defined radios-in-the-loop. Building on the scale and computational capabilities of the Colosseum wireless network emulator, ColO-RAN enables ML research at scale using O-RAN components, programmable base stations, and a "wireless data factory". Specifically, we design and develop three exemplary xApps for DRL-based control of RAN slicing, scheduling and online model training, and evaluate their performance on a cellular network with 7 softwarized base stations and 42 users. Finally, we showcase the portability of ColO-RAN to different platforms by deploying it on Arena, an indoor programmable testbed. Extensive results from our first-of-its-kind large-scale evaluation highlight the benefits and challenges of DRL-based adaptive control. They also provide insights on the development of wireless DRL pipelines, from data analysis to the design of DRL agents, and on the tradeoffs associated to training on a live RAN. ColO-RAN and the collected large-scale dataset will be made publicly available to the research community.
    Entity-Based Knowledge Conflicts in Question Answering. (arXiv:2109.05052v2 [cs.CL] UPDATED)
    (2 min) Knowledge-dependent tasks typically use two sources of knowledge: parametric, learned at training time, and contextual, given as a passage at inference time. To understand how models use these sources together, we formalize the problem of knowledge conflicts, where the contextual information contradicts the learned information. Analyzing the behaviour of popular models, we measure their over-reliance on memorized information (the cause of hallucinations), and uncover important factors that exacerbate this behaviour. Lastly, we propose a simple method to mitigate over-reliance on parametric knowledge, which minimizes hallucination, and improves out-of-distribution generalization by 4%-7%. Our findings demonstrate the importance for practitioners to evaluate model tendency to hallucinate rather than read, and show that our mitigation strategy encourages generalization to evolving information (i.e., time-dependent queries). To encourage these practices, we have released our framework for generating knowledge conflicts.
    Skin3D: Detection and Longitudinal Tracking of Pigmented Skin Lesions in 3D Total-Body Textured Meshes. (arXiv:2105.00374v2 [cs.CV] UPDATED)
    (2 min) We present an automated approach to detect and longitudinally track skin lesions on 3D total-body skin surface scans. The acquired 3D mesh of the subject is unwrapped to a 2D texture image, where a trained objected detection model, Faster R-CNN, localizes the lesions within the 2D domain. These detected skin lesions are mapped back to the 3D surface of the subject and, for subjects imaged multiple times, we construct a graph-based matching procedure to longitudinally track lesions that considers the anatomical correspondences among pairs of meshes and the geodesic proximity of corresponding lesions and the inter-lesion geodesic distances. We evaluated the proposed approach using 3DBodyTex, a publicly available dataset composed of 3D scans imaging the coloured skin (textured meshes) of 200 human subjects. We manually annotated locations that appeared to the human eye to contain a pigmented skin lesion as well as tracked a subset of lesions occurring on the same subject imaged in different poses. Our results, when compared to three human annotators, suggest that the trained Faster R-CNN detects lesions at a similar performance level as the human annotators. Our lesion tracking algorithm achieves an average matching accuracy of 88% on a set of detected corresponding pairs of prominent lesions of subjects imaged in different poses, and an average longitudinal accuracy of 71% when encompassing additional errors due to lesion detection. As there currently is no other large-scale publicly available dataset of 3D total-body skin lesions, we publicly release over 25,000 3DBodyTex manual annotations, which we hope will further research on total-body skin lesion analysis.
    Curriculum Offline Imitation Learning. (arXiv:2111.02056v2 [cs.LG] UPDATED)
    (2 min) Offline reinforcement learning (RL) tasks require the agent to learn from a pre-collected dataset with no further interactions with the environment. Despite the potential to surpass the behavioral policies, RL-based methods are generally impractical due to the training instability and bootstrapping the extrapolation errors, which always require careful hyperparameter tuning via online evaluation. In contrast, offline imitation learning (IL) has no such issues since it learns the policy directly without estimating the value function by bootstrapping. However, IL is usually limited in the capability of the behavioral policy and tends to learn a mediocre behavior from the dataset collected by the mixture of policies. In this paper, we aim to take advantage of IL but mitigate such a drawback. Observing that behavior cloning is able to imitate neighboring policies with less data, we propose \textit{Curriculum Offline Imitation Learning (COIL)}, which utilizes an experience picking strategy for imitating from adaptive neighboring policies with a higher return, and improves the current policy along curriculum stages. On continuous control benchmarks, we compare COIL against both imitation-based and RL-based methods, showing that it not only avoids just learning a mediocre behavior on mixed datasets but is also even competitive with state-of-the-art offline RL methods.
    Testing Self-Organized Criticality Across the Main Sequence using Stellar Flares from TESS. (arXiv:2109.07011v2 [astro-ph.SR] UPDATED)
    (2 min) Self-organized criticality describes a class of dynamical systems that maintain themselves in an attractor state with no intrinsic length or time scale. Fundamentally, this theoretical construct requires a mechanism for instability that may trigger additional instabilities locally via dissipative processes. This concept has been invoked to explain nonlinear dynamical phenomena such as featureless energy spectra that have been observed empirically for earthquakes, avalanches, and solar flares. If this interpretation proves correct, it implies that the solar coronal magnetic field maintains itself in a critical state via a delicate balance between the dynamo-driven injection of magnetic energy and the release of that energy via flaring events. All-sky high-cadence surveys like the Transiting Exoplanet Survey Satellite (TESS) provide the necessary data to compare the energy distribution of flaring events in stars of different spectral types to that observed in the Sun. We identified $\sim 10^6$ flaring events on $\sim 10^5$ stars observed by TESS at 2-minute cadence. By fitting the flare frequency distribution for different mass bins, we find that all main sequence stars exhibit distributions of flaring events similar to that observed in the Sun, independent of their mass or age. This may suggest that stars universally maintain a critical state in their coronal topologies via magnetic reconnection events. If this interpretation proves correct, we may be able to infer properties of magnetic fields, interior structure, and dynamo mechanisms for stars that are otherwise unresolved point sources.
    Game Theory for Adversarial Attacks and Defenses. (arXiv:2110.06166v3 [cs.LG] UPDATED)
    (2 min) Adversarial attacks can generate adversarial inputs by applying small but intentionally worst-case perturbations to samples from the dataset, which leads to even state-of-the-art deep neural networks outputting incorrect answers with high confidence. Hence, some adversarial defense techniques are developed to improve the security and robustness of the models and avoid them being attacked. Gradually, a game-like competition between attackers and defenders formed, in which both players would attempt to play their best strategies against each other while maximizing their own payoffs. To solve the game, each player would choose an optimal strategy against the opponent based on the prediction of the opponent's strategy choice. In this work, we are on the defensive side to apply game-theoretic approaches on defending against attacks. We use two randomization methods, random initialization and stochastic activation pruning, to create diversity of networks. Furthermore, we use one denoising technique, super resolution, to improve models' robustness by preprocessing images before attacks. Our experimental results indicate that those three methods can effectively improve the robustness of deep-learning neural networks.
    Who Is the Strongest Enemy? Towards Optimal and Efficient Evasion Attacks in Deep RL. (arXiv:2106.05087v3 [cs.LG] UPDATED)
    (2 min) Evaluating the worst-case performance of a reinforcement learning (RL) agent under the strongest/optimal adversarial perturbations on state observations (within some constraints) is crucial for understanding the robustness of RL agents. However, finding the optimal adversary is challenging, in terms of both whether we can find the optimal attack and how efficiently we can find it. Existing works on adversarial RL either use heuristics-based methods that may not find the strongest adversary, or directly train an RL-based adversary by treating the agent as a part of the environment, which can find the optimal adversary but may become intractable in a large state space. This paper introduces a novel attacking method to find the optimal attacks through collaboration between a designed function named "actor" and an RL-based learner named "director". The actor crafts state perturbations for a given policy perturbation direction, and the director learns to propose the best policy perturbation directions. Our proposed algorithm, PA-AD, is theoretically optimal and significantly more efficient than prior RL-based works in environments with large state spaces. Empirical results show that our proposed PA-AD universally outperforms state-of-the-art attacking methods in various Atari and MuJoCo environments. By applying PA-AD to adversarial training, we achieve state-of-the-art empirical robustness in multiple tasks under strong adversaries.
    GraphVAMPNet, using graph neural networks and variational approach to markov processes for dynamical modeling of biomolecules. (arXiv:2201.04609v1 [physics.comp-ph])
    (2 min) Finding low dimensional representation of data from long-timescale trajectories of biomolecular processes such as protein-folding or ligand-receptor binding is of fundamental importance and kinetic models such as Markov modeling have proven useful in describing the kinetics of these systems. Recently, an unsupervised machine learning technique called VAMPNet was introduced to learn the low dimensional representation and linear dynamical model in an end-to-end manner. VAMPNet is based on variational approach to Markov processes (VAMP) and relies on neural networks to learn the coarse-grained dynamics. In this contribution, we combine VAMPNet and graph neural networks to generate an end-to-end framework to efficiently learn high-level dynamics and metastable states from the long-timescale molecular dynamics trajectories. This method bears the advantages of graph representation learning and uses graph message passing operations to generate an embedding for each datapoint which is used in the VAMPNet to generate a coarse-grained representation. This type of molecular representation results in a higher resolution and more interpretable Markov model than the standard VAMPNet enabling a more detailed kinetic study of the biomolecular processes. Our GraphVAMPNet approach is also enhanced with an attention mechanism to find the important residues for classification into different metastable states.
    Agent-Temporal Attention for Reward Redistribution in Episodic Multi-Agent Reinforcement Learning. (arXiv:2201.04612v1 [cs.MA])
    (2 min) This paper considers multi-agent reinforcement learning (MARL) tasks where agents receive a shared global reward at the end of an episode. The delayed nature of this reward affects the ability of the agents to assess the quality of their actions at intermediate time-steps. This paper focuses on developing methods to learn a temporal redistribution of the episodic reward to obtain a dense reward signal. Solving such MARL problems requires addressing two challenges: identifying (1) relative importance of states along the length of an episode (along time), and (2) relative importance of individual agents' states at any single time-step (among agents). In this paper, we introduce Agent-Temporal Attention for Reward Redistribution in Episodic Multi-Agent Reinforcement Learning (AREL) to address these two challenges. AREL uses attention mechanisms to characterize the influence of actions on state transitions along trajectories (temporal attention), and how each agent is affected by other agents at each time-step (agent attention). The redistributed rewards predicted by AREL are dense, and can be integrated with any given MARL algorithm. We evaluate AREL on challenging tasks from the Particle World environment and the StarCraft Multi-Agent Challenge. AREL results in higher rewards in Particle World, and improved win rates in StarCraft compared to three state-of-the-art reward redistribution methods. Our code is available at https://github.com/baicenxiao/AREL.
  • stat.ML updates on arXiv.org

    Fast Online Changepoint Detection via Functional Pruning CUSUM statistics. (arXiv:2110.08205v2 [stat.ME] UPDATED)
    (2 min) Many modern applications of online changepoint detection require the ability to process high-frequency observations, sometimes with limited available computational resources. Online algorithms for detecting a change in mean often involve using a moving window, or specifying the expected size of change. Such choices affect which changes the algorithms have most power to detect. We introduce an algorithm, Functional Online CuSUM (FOCuS), which is equivalent to running these earlier methods simultaneously for all sizes of window, or all possible values for the size of change. Our theoretical results give tight bounds on the expected computational cost per iteration of FOCuS, with this being logarithmic in the number of observations. We show how FOCuS can be applied to a number of different change in mean scenarios, and demonstrate its practical utility through its state-of-the art performance at detecting anomalous behaviour in computer server data.
    Embedding Principle of Loss Landscape of Deep Neural Networks. (arXiv:2105.14573v3 [cs.LG] UPDATED)
    (2 min) Understanding the structure of loss landscape of deep neural networks (DNNs)is obviously important. In this work, we prove an embedding principle that the loss landscape of a DNN "contains" all the critical points of all the narrower DNNs. More precisely, we propose a critical embedding such that any critical point, e.g., local or global minima, of a narrower DNN can be embedded to a critical point/hyperplane of the target DNN with higher degeneracy and preserving the DNN output function. The embedding structure of critical points is independent of loss function and training data, showing a stark difference from other nonconvex problems such as protein-folding. Empirically, we find that a wide DNN is often attracted by highly-degenerate critical points that are embedded from narrow DNNs. The embedding principle provides an explanation for the general easy optimization of wide DNNs and unravels a potential implicit low-complexity regularization during the training. Overall, our work provides a skeleton for the study of loss landscape of DNNs and its implication, by which a more exact and comprehensive understanding can be anticipated in the near
    Manifold learning via quantum dynamics. (arXiv:2112.11161v2 [quant-ph] UPDATED)
    (2 min) We introduce an algorithm for computing geodesics on sampled manifolds that relies on simulation of quantum dynamics on a graph embedding of the sampled data. Our approach exploits classic results in semiclassical analysis and the quantum-classical correspondence, and forms a basis for techniques to learn the manifold from which a dataset is sampled, and subsequently for nonlinear dimensionality reduction of high-dimensional datasets. We illustrate the new algorithm with data sampled from model manifolds and also by a clustering demonstration based on COVID-19 mobility data. Finally, our method reveals interesting connections between the discretization provided by data sampling and quantization.
    When Random Tensors meet Random Matrices. (arXiv:2112.12348v2 [math.PR] UPDATED)
    (2 min) Relying on random matrix theory (RMT), this paper studies asymmetric order-$d$ spiked tensor models with Gaussian noise. Using the variational definition of the singular vectors and values of (Lim, 2005), we show that the analysis of the considered model boils down to the analysis of an equivalent spiked symmetric block-wise random matrix, that is constructed from contractions of the studied tensor with the singular vectors associated to its best rank-1 approximation. Our approach allows the exact characterization of the almost sure asymptotic singular value and alignments of the corresponding singular vectors with the true spike components, when $\frac{n_i}{\sum_{j=1}^d n_j}\to c_i\in [0, 1]$ with $n_i$'s the tensor dimensions. In contrast to other works that rely mostly on tools from statistical physics to study random tensors, our results rely solely on classical RMT tools such as Stein's lemma. Finally, classical RMT results concerning spiked random matrices are recovered as a particular case.
    Test for non-negligible adverse shifts. (arXiv:2107.02990v3 [stat.ML] UPDATED)
    (2 min) Statistical tests for dataset shift are susceptible to false alarms: they are sensitive to minor differences when there is in fact adequate sample coverage and predictive performance. We propose instead a framework to detect adverse dataset shifts based on outlier scores, $\texttt{D-SOS}$ for short. $\texttt{D-SOS}$ holds that the new (test) sample is not substantively worse than the reference (training) sample, and not that the two are equal. The key idea is to reduce observations to outlier scores and compare contamination rates at varying weighted thresholds. Users can define what $\it{worse}$ means in terms of relevant notions of outlyingness, including proxies for predictive performance. Compared to tests of equal distribution, our approach is uniquely tailored to serve as a robust metric for model monitoring and data validation. We show how versatile and practical $\texttt{D-SOS}$ is on a wide range of real and simulated data.
    An Embedded Model Estimator for Non-Stationary Random Functions using Multiple Secondary Variables. (arXiv:2011.04116v4 [stat.ME] UPDATED)
    (2 min) An algorithm for non-stationary spatial modelling using multiple secondary variables is developed. It combines Geostatistics with Quantile Random Forests to give a new interpolation and stochastic simulation algorithm. This paper introduces the method and shows that it has consistency results that are similar in nature to those applying to geostatistical modelling and to Quantile Random Forests. The method allows for embedding of simpler interpolation techniques, such as Kriging, to further condition the model. The algorithm works by estimating a conditional distribution for the target variable at each target location. The family of such distributions is called the envelope of the target variable. From this, it is possible to obtain spatial estimates, quantiles and uncertainty. An algorithm to produce conditional simulations from the envelope is also developed. As they sample from the envelope, realizations are therefore locally influenced by relative changes of importance of secondary variables, trends and variability.
    Interacting with Explanations through Critiquing. (arXiv:2005.11067v4 [cs.CL] UPDATED)
    (2 min) Using personalized explanations to support recommendations has been shown to increase trust and perceived quality. However, to actually obtain better recommendations, there needs to be a means for users to modify the recommendation criteria by interacting with the explanation. We present a novel technique using aspect markers that learns to generate personalized explanations of recommendations from review texts, and we show that human users significantly prefer these explanations over those produced by state-of-the-art techniques. Our work's most important innovation is that it allows users to react to a recommendation by critiquing the textual explanation: removing (symmetrically adding) certain aspects they dislike or that are no longer relevant (symmetrically that are of interest). The system updates its user model and the resulting recommendations according to the critique. This is based on a novel unsupervised critiquing method for single- and multi-step critiquing with textual explanations. Experiments on two real-world datasets show that our system is the first to achieve good performance in adapting to the preferences expressed in multi-step critiquing.
    A semi-agnostic ansatz with variable structure for quantum machine learning. (arXiv:2103.06712v2 [quant-ph] UPDATED)
    (2 min) Quantum machine learning (QML) offers a powerful, flexible paradigm for programming near-term quantum computers, with applications in chemistry, metrology, materials science, data science, and mathematics. Here, one trains an ansatz, in the form of a parameterized quantum circuit, to accomplish a task of interest. However, challenges have recently emerged suggesting that deep ansatzes are difficult to train, due to flat training landscapes caused by randomness or by hardware noise. This motivates our work, where we present a variable structure approach to build ansatzes for QML. Our approach, called VAns (Variable Ansatz), applies a set of rules to both grow and (crucially) remove quantum gates in an informed manner during the optimization. Consequently, VAns is ideally suited to mitigate trainability and noise-related issues by keeping the ansatz shallow. We employ VAns in the variational quantum eigensolver for condensed matter and quantum chemistry applications and also in the quantum autoencoder for data compression, showing successful results in all cases.
    Eigenvalue Distribution of Large Random Matrices Arising in Deep Neural Networks: Orthogonal Case. (arXiv:2201.04543v1 [stat.ML])
    (2 min) The paper deals with the distribution of singular values of the input-output Jacobian of deep untrained neural networks in the limit of their infinite width. The Jacobian is the product of random matrices where the independent rectangular weight matrices alternate with diagonal matrices whose entries depend on the corresponding column of the nearest neighbor weight matrix. The problem was considered in \cite{Pe-Co:18} for the Gaussian weights and biases and also for the weights that are Haar distributed orthogonal matrices and Gaussian biases. Basing on a free probability argument, it was claimed that in these cases the singular value distribution of the Jacobian in the limit of infinite width (matrix size) coincides with that of the analog of the Jacobian with special random but weight independent diagonal matrices, the case well known in random matrix theory. The claim was rigorously proved in \cite{Pa-Sl:21} for a quite general class of weights and biases with i.i.d. (including Gaussian) entries by using a version of the techniques of random matrix theory. In this paper we use another version of the techniques to justify the claim for random Haar distributed weight matrices and Gaussian biases.
    Analysis of autocorrelation times in Neural Markov Chain Monte Carlo simulations. (arXiv:2111.10189v2 [cond-mat.stat-mech] UPDATED)
    (2 min) We provide a deepened study of autocorrelations in Neural Markov Chain Monte Carlo simulations, a version of the traditional Metropolis algorithm which employs neural networks to provide independent proposals. We illustrate our ideas using the two-dimensional Ising model. We propose several estimates of autocorrelation times, some inspired by analytical results derived for the Metropolized Independent Sampler, which we compare and study as a function of inverse temperature $\beta$. Based on that we propose an alternative loss function and study its impact on the autocorelation times. Furthermore, we investigate the impact of imposing system symmetries ($Z_2$ and/or translational) in the neural network training process on the autocorrelation times. Eventually, we propose a scheme which incorporates partial heat-bath updates. The impact of the above enhancements is discussed for a $16 \times 16$ spin system. The summary of our findings may serve as a guide to the implementation of Neural Markov Chain Monte Carlo simulations of more complicated models.
    Deep Hedging: Learning to Remove the Drift under Trading Frictions with Minimal Equivalent Near-Martingale Measures. (arXiv:2111.07844v3 [q-fin.CP] UPDATED)
    (2 min) We present a machine learning approach for finding minimal equivalent martingale measures for markets simulators of tradable instruments, e.g. for a spot price and options written on the same underlying. We extend our results to markets with frictions, in which case we find "near-martingale measures" under which the prices of hedging instruments are martingales within their bid/ask spread. By removing the drift, we are then able to learn using Deep Hedging a "clean" hedge for an exotic payoff which is not polluted by the trading strategy trying to make money from statistical arbitrage opportunities. We correspondingly highlight the robustness of this hedge vs estimation error of the original market simulator. We discuss applications to two market simulators.
    Representation of Context-Specific Causal Models with Observational and Interventional Data. (arXiv:2101.09271v3 [math.ST] UPDATED)
    (2 min) We consider the problem of representing causal models that encode context-specific information for discrete data using a proper subclass of staged tree models which we call CStrees. We show that the context-specific information encoded by a CStree can be equivalently expressed via a collection of DAGs. As not all staged tree models admit this property, CStrees are a subclass that provides a transparent, intuitive and compact representation of context-specific causal information. We prove that CStrees admit a global Markov property which yields a graphical criterion for model equivalence generalizing that of Verma and Pearl for DAG models. These results extend to the general interventional model setting, making CStrees the first family of context-specific models admitting a characterization of interventional model equivalence. We also provide a closed-form formula for the maximum likelihood estimator of a CStree and use it to show that the Bayesian information criterion is a locally consistent score function for this model class. The performance of CStrees is analyzed on both simulated and real data, where we see that modeling with CStrees instead of general staged trees does not result in a significant loss of predictive accuracy, while affording DAG representations of context-specific causal information.
    On generalization bounds for deep networks based on loss surface implicit regularization. (arXiv:2201.04545v1 [stat.ML])
    (2 min) The classical statistical learning theory says that fitting too many parameters leads to overfitting and poor performance. That modern deep neural networks generalize well despite a large number of parameters contradicts this finding and constitutes a major unsolved problem towards explaining the success of deep learning. The implicit regularization induced by stochastic gradient descent (SGD) has been regarded to be important, but its specific principle is still unknown. In this work, we study how the local geometry of the energy landscape around local minima affects the statistical properties of SGD with Gaussian gradient noise. We argue that under reasonable assumptions, the local geometry forces SGD to stay close to a low dimensional subspace and that this induces implicit regularization and results in tighter bounds on the generalization error for deep neural networks. To derive generalization error bounds for neural networks, we first introduce a notion of stagnation sets around the local minima and impose a local essential convexity property of the population risk. Under these conditions, lower bounds for SGD to remain in these stagnation sets are derived. If stagnation occurs, we derive a bound on the generalization error of deep neural networks involving the spectral norms of the weight matrices but not the number of network parameters. Technically, our proofs are based on controlling the change of parameter values in the SGD iterates and local uniform convergence of the empirical loss functions based on the entropy of suitable neighborhoods around local minima. Our work attempts to better connect non-convex optimization and generalization analysis with uniform convergence.
    Fighting Money-Laundering with Statistics and Machine Learning: An Introduction and Review. (arXiv:2201.04207v1 [stat.ML])
    (2 min) Money laundering is a profound, global problem. Nonetheless, there is little statistical and machine learning research on the topic. In this paper, we focus on anti-money laundering in banks. To help organize existing research in the field, we propose a unifying terminology and provide a review of the literature. This is structured around two central tasks: (i) client risk profiling and (ii) suspicious behavior flagging. We find that client risk profiling is characterized by diagnostics, i.e., efforts to find and explain risk factors. Suspicious behavior flagging, on the other hand, is characterized by non-disclosed features and hand-crafted risk indices. Finally, we discuss directions for future research. One major challenge is the lack of public data sets. This may, potentially, be addressed by synthetic data generation. Other possible research directions include semi-supervised and deep learning, interpretability and fairness of the results.
    Improving Disentangled Text Representation Learning with Information-Theoretic Guidance. (arXiv:2006.00693v3 [cs.LG] UPDATED)
    (2 min) Learning disentangled representations of natural language is essential for many NLP tasks, e.g., conditional text generation, style transfer, personalized dialogue systems, etc. Similar problems have been studied extensively for other forms of data, such as images and videos. However, the discrete nature of natural language makes the disentangling of textual representations more challenging (e.g., the manipulation over the data space cannot be easily achieved). Inspired by information theory, we propose a novel method that effectively manifests disentangled representations of text, without any supervision on semantics. A new mutual information upper bound is derived and leveraged to measure dependence between style and content. By minimizing this upper bound, the proposed method induces style and content embeddings into two independent low-dimensional spaces. Experiments on both conditional text generation and text-style transfer demonstrate the high quality of our disentangled representation in terms of content and style preservation.
    Leveraging Unlabeled Data to Predict Out-of-Distribution Performance. (arXiv:2201.04234v1 [cs.LG])
    (2 min) Real-world machine learning deployments are characterized by mismatches between the source (training) and target (test) distributions that may cause performance drops. In this work, we investigate methods for predicting the target domain accuracy using only labeled source data and unlabeled target data. We propose Average Thresholded Confidence (ATC), a practical method that learns a threshold on the model's confidence, predicting accuracy as the fraction of unlabeled examples for which model confidence exceeds that threshold. ATC outperforms previous methods across several model architectures, types of distribution shifts (e.g., due to synthetic corruptions, dataset reproduction, or novel subpopulations), and datasets (Wilds, ImageNet, Breeds, CIFAR, and MNIST). In our experiments, ATC estimates target performance $2$-$4\times$ more accurately than prior methods. We also explore the theoretical foundations of the problem, proving that, in general, identifying the accuracy is just as hard as identifying the optimal predictor and thus, the efficacy of any method rests upon (perhaps unstated) assumptions on the nature of the shift. Finally, analyzing our method on some toy distributions, we provide insights concerning when it works.
    Optimal Fixed-Budget Best Arm Identification using the Augmented Inverse Probability Estimator in Two-Armed Gaussian Bandits with Unknown Variances. (arXiv:2201.04469v1 [stat.ML])
    (2 min) We consider the fixed-budget best arm identification problem in two-armed Gaussian bandits with unknown variances. The tightest lower bound on the complexity and an algorithm whose performance guarantee matches the lower bound have long been open problems when the variances are unknown and when the algorithm is agnostic to the optimal proportion of the arm draws. In this paper, we propose a strategy comprising a sampling rule with randomized sampling (RS) following the estimated target allocation probabilities of arm draws and a recommendation rule using the augmented inverse probability weighting (AIPW) estimator, which is often used in the causal inference literature. We refer to our strategy as the RS-AIPW strategy. In the theoretical analysis, we first derive a large deviation principle for martingales, which can be used when the second moment converges in mean, and apply it to our proposed strategy. Then, we show that the proposed strategy is asymptotically optimal in the sense that the probability of misidentification achieves the lower bound by Kaufmann et al. (2016) when the sample size becomes infinitely large and the gap between the two arms goes to zero.
  • Google AI Blog

    Learning to Route by Task for Efficient Inference
    (7 min) Posted by Sneha Kudugunta, Research Software Engineer and Orhan Firat, Research Scientist, Google Research Scaling large language models has resulted in significant quality improvements natural language understanding (T5), generation (GPT-3) and multilingual neural machine translation (M4). One common approach to building a larger model is to increase the depth (number of layers) and width (layer dimensionality), simply enlarging existing dimensions of the network. Such dense models take an input sequence (divided into smaller components, called tokens) and pass every token through the full network, activating every layer and parameter. While these large, dense models have achieved state-of-the-art results on multiple natural language processing (NLP) tasks, their training cost increase…
  • The Official NVIDIA Blog

    From Imagination to Animation, How an Omniverse Creator Makes Films Virtually
    (3 min) Growing up in the Philippines, award-winning filmmaker Jae Solina says he turned to movies for a reminder that the world was much larger than himself and his homeland. The post From Imagination to Animation, How an Omniverse Creator Makes Films Virtually appeared first on The Official NVIDIA Blog.
  • Blog

    Profiling Python Code
    (18 min) Profiling is a technique to figure out how time is spent in a program. With this statistics, we can find […] The post Profiling Python Code appeared first on Machine Learning Mastery.
  • MIT News - Artificial intelligence

    “Hey, Alexa! Are you trustworthy?”
    (7 min) The more social behaviors a voice-user interface exhibits, the more likely people are to trust it, engage with it, and consider it to be competent.
  • AWS Machine Learning Blog

    Label text for aspect-based sentiment analysis using SageMaker Ground Truth
    (13 min) The Amazon Machine Learning Solutions Lab (MLSL) recently created a tool for annotating text with named-entity recognition (NER) and relationship labels using Amazon SageMaker Ground Truth. Annotators use this tool to label text with named entities and link their relationships, thereby building a dataset for training state-of-the-art natural language processing (NLP) machine learning (ML) models. Most […]
  • Artificial Intelligence

    Looking for a text to speech AI
    (0 min) Hi, so I was looking for a text to speech, AI, much like natural readers or murf. Do you guys have any idea of one that is as good as them but is free? submitted by /u/ArturEPinheiro777 [link] [comments]
    Scientific Literature Review Generation v1.0
    (1 min) Hello , I've developed after my PhD a first version of an algorithm to automatically generate a literature review : https://www.naimai.fr and many remarks were given. I just deployed a new version with much more papers and I'll be thankful if you have any remarks about it :) More about the new version here : https://yaassinekaddi.medium.com/scientific-literature-generation-ii-73628aebd4fb Hopefully that could be useful for the PhDs (and the non PhDs) ! ​ Cheers, submitted by /u/Cyalas [link] [comments]
    Searching participants for art project
    (1 min) Hi, I’m part of an art group from Switzerland currently studying at HSLU Design & Arts (https://www.hslu.ch/de-ch/design-kunst/studium/bachelor/camera-arts/). The group consists of: Karim Beji (https://www.instagram.com/karimbeji_/ https://karimbeji.ch/) Emanuel Bohnenblust (https://www.instagram.com/e.bohnenblust/) Lea Karabash (https://www.instagram.com/leakarabashian/) Yen Shih-hsuan (https://www.instagram.com/shixuan.yan/ http://syen.hfk-bremen.de/) At the moment, we are working on a project on the topic if AI can augment the happiness of humans. To answer this question, we are mainly working with chatbots. The end result is going to be an exhibition at the end of March. For that exhibition, we want to conduct a trial in which people from over the world chat with a chatbot to find out if and how it augments the mood of the participants. We would give you access to a GPT-3 (OpenAI) chatbot and ask you to a) record yourself through a webcam (laptop) while you are chatting and b) simultaneously screen record the chat window. In the exhibition we would have a) a book with all the chats and b) small videos with your faces (webcam) to assess your mood. We would have a Zoom meeting beforehand to discuss everything. Looking forward to your message! submitted by /u/Nebeldiener [link] [comments]
    Are there any A.I. projects being developed that turn 2D (abstract) art into complex composition 3D models according to specs and dataset training of the A.I.? Is there any plugin like this for a 3D app?
    (1 min) A plugin that recognizes dots on a 2D picture to pull on them and make line predictions other things. Not mech engineering apps or apps that turn a set of 2D photos into 3D. submitted by /u/Niu_Davinci [link] [comments]
    Can you complete this short survey on the applications of conversational AI? I'd be much obliged 😅
    (0 min) submitted by /u/KazRainer [link] [comments]
    Growth Opportunities for Global Artificial Intelligence in Automotive
    (1 min) This research service examines the role artificial intelligence (AI) will play in the transformation of the automotive space. AI is a key disruptive technology, wherein automakers are evolving into technology firms and expanding their service offerings beyond manufacturing vehicles. Technology implementation has increased, and the post-pandemic situation appears to be positive for all stakeholders; however, automakers have yet to fully harness AI’s potential in their service offerings. Although AI is in the nascent stage of development, OEMs are adopting it across the automotive value chain to improve manufacturing and to enhance customer experience, marketing, sales, and after-sales services. This report examines use cases and business opportunity areas for various players in the automotive ecosystem, including OEMs, Tier I suppliers and technology service providers, and new entrants or start-ups. As the industry continues to evolve, AI capabilities will become the core of automotive solutions. The study identifies key AI trends impacting the industry, including the convergence of connectivity, autonomous, sharing/subscription, and electrification (CASE) ​ https://uk.sports.yahoo.com/news/growth-opportunities-global-artificial-intelligence-101800220.html?guccounter=1 submitted by /u/ai_datascience [link] [comments]
    Researchers From NVIDIA & Vanderbilt University Propose ‘Swin UNETR’: A Novel Architecture for Semantic Segmentation of Brain Tumors Using Multi-Modal MRI Images
    (1 min) The human brain is affected by about 120 different forms of brain tumors. AI-based intervention for tumor identification and surgical pre-assessment is on the verge of becoming a necessity rather than a luxury as we approach the era of artificial intelligence (AI) in healthcare. Brain tumors can be characterized in depth using techniques like volumetric analysis to track their growth and aid in pre-surgical planning. The characterization of defined tumors can be directly used for the prognosis of life expectancy, in addition to surgical uses. The segmentation of brain tumors is at the forefront of all such applications. Primary and secondary tumors are the two forms of brain tumors. Primary brain cancers arise from brain cells, whereas secondary tumors spread from other organs to the brain. Continue Reading Paper: https://arxiv.org/pdf/2201.01266v1.pdf Github: https://github.com/Project-MONAI/research-contributions/tree/master/SwinUNETR ​ https://preview.redd.it/tdmx86qgxkb81.png?width=996&format=png&auto=webp&s=59fe21d046e4fcba6b9ae4dd38890512757b8ffb submitted by /u/techsucker [link] [comments]
    Image to Text A.I??
    (1 min) Like the title says, I am looking to see if there is an A.I. that will generate a string of text based on what it thinks it sees in a given image (Eg./ [image of human DNA strand], and the A.I would type something like "image of the human genome, encoded as DNA, then list off the parts of the DNA strand [nuclei etc...].) I know text to image already exists but I cannot seem to find anything like this, even a more simple version that can process planets or animals or anything, thanks! submitted by /u/FreeOfGovernment [link] [comments]
  • Machine Learning

    [P] Sentiment Analysis on 7,000+ Words?
    (1 min) Alright, so quick disclaimer: I'm a high-schooler, and probably biting off more than I can chew, but I'm trying this either way, so please excuse any stupid questions. I'm looking to do sentiment analysis on a number of 7,000+ word long reports that have been released over a period of years to look at the change in sentiment. From what I've done and from what I've seen others do, most sentiment analysis is on text that is much, much shorter (~sentence length). Is there any model that can take text of that length? If not, my other thought was to split each report up into individual sentences, feed those through the model, and average the sentiment for each report. Could that work? Or would the results be utter crap? Are there better ways of tackling this issue? Any advice or information would be appreciated, thank you all so much! submitted by /u/Blue_Jay89 [link] [comments]
    [D] How is the MDP Homomorphism approach in the below paper different from RL with an Embedding Space like for example CURL?
    (1 min) I'm reading: Plannable Approximations to MDP Homomorphisms: Equivariance under Actions (arxiv: https://arxiv.org/pdf/2002.11963.pdf) and I'm not understanding how this approach differs from RL training in embedding space, for example, CURL. I do understand that they seem to be training both in the base space and the embedding space, but I'm not sure what this buys them. Thanks for your help! submitted by /u/www3cam [link] [comments]
    [Discussion] What type of a ML community member are you?
    (1 min) Hi all! I was thinking about how easy it is to follow Machine Learning as a non-professional in a such a depth as to have a reasonable/functional understanding of it, and got curious about what are the types of users in this community. If you are not a professional, don't plan to be, or you selected Other, feel free to elaborate about your relationship with ML in the comments. View Poll submitted by /u/TsirixtoVatraxi [link] [comments]
    [P] Back to back YOLO?
    (1 min) I'm working on a real-time image recognition project (shocker) and want to identify main classes (Car, sign, tree) as well as subclasses within them ((Audi, BMW, Ford), (Stop, Yield, Speed Limit), (Pine, Douglas, Maple)). It would be cool to do sub-sub classes, like ID Sign->Speed Limit->45MPH, but that might be extreme. I was thinking running YOLO to do the main class and then piping the results into separate YOLOs might work, but can't find much on how that might be done or if it's even a good idea. Any guidance would be a great help!! submitted by /u/CosmosFood [link] [comments]
    [D] What framework do you do your data processing (small and large datasets) in?
    (1 min) Hey all! I'm curious to know what everyone does their data processing in? For industry purposes on smaller datasets I've used Pandas and sklearn, while for larger ones I've used Dask. I know that Tensorflow and Pytorch have their own dataloader frameworks but does it scale to large datasets? Say 100GB++? Do people do data transformations in Jax? if you had infinite time how would you (re)do your company tech stack? My data is 85% timeseries, 10% images, and 5% tabular data. submitted by /u/iamquah [link] [comments]
    [P] Scientific Literature Review Generation v1.0
    (1 min) Hello , I've developed after my PhD a first version of an algorithm to automatically generate a literature review : https://www.naimai.fr and many remarks were given. I just deployed a new version with much more papers and I'll be thankful if you have any remarks about it :) More about the new version here : https://yaassinekaddi.medium.com/scientific-literature-generation-ii-73628aebd4fb Hopefully that could be useful for the PhDs (and the non PhDs) ! ​ Cheers, submitted by /u/Cyalas [link] [comments]
    [D] Reinforcement Learning As A Fine-Tuning Paradigm
    (0 min) https://ankeshanand.com/blog/2022/01/08/rl-fine-tuning.html ​ Reinforcement Learning (RL) should be better seen as a “fine-tuning” paradigm that can add capabilities to general-purpose pretrained models, rather than a paradigm that can bootstrap intelligence from scratch. submitted by /u/EducationalCicada [link] [comments]
    [D] Leading Research Groups/ Researchers in AI Robotics & Theoretical RL
    (1 min) Hi, I would like to know who are the current leading researchers/ research groups in AI Robotics and other in Theoretical Reinforcement Learning. Groups I know: Theoretical Reinforcement Learning - Rich Sutton's Lab at UofAlberta AI Robotics Groups: Stanford AI Lab Sergey Levine's Lab I want to make a comprehensive list of this and would really appreciate any kind of help here.Thanks. submitted by /u/EqualMarsupial2759 [link] [comments]
    [R] GradMax: Growing Neural Networks using Gradient Information
    (0 min) submitted by /u/hardmaru [link] [comments]
    [N] Machine Learning for Audio with Hugging Face
    (3 min) Hi! Over the last year at Hugging Face we’ve been working quite a bit to enable people to take advantage of state-of-the-art ML methods for audio, and we’re trying to get the word out about our efforts. Meh, I’m not a 🤗 Transformers user, why should I care? We’re not just building stuff for 🤗 Transformers users! We have built integrations with open-source libraries such as speechbrain, pyannote, ESPnet, Asteroid and Facebook Fairseq as well as dataset tools. Some example models you can try directly in the browser: FastSpeech 2 for Text to Speech pyannote Audio Segmentation SpeechBrain Language Identification model Facebook XLS-R demo to translate your audio from any language to any other target language Where’s the data? We’re working on improving audio support in the datase…
    [D] What is the most accurate open source ASR library with multiple language support?
    (1 min) I am working on one project involving free speech and I was wondering what your opinions are what are the best open source libraries that can do the transcription job well. (I use python and the languages are German and Dutch apart from English ). I heard about NeMo from Nvidia, however currently they do not support Dutch? I know that sometimes solutions are not stable and of course it relies on the quality of the recording a lot, but assuming the good quality of the record, etc. submitted by /u/MrJacobJohnson [link] [comments]
    [R] Bayesian additive regression trees and the General BART model
    (1 min) tl;dr: There are a variety of BART models for different kinds of data, this article reviews a bunch of them. abstract: "Bayesian additive regression trees (BART) is a flexible prediction model/machine learning approach that has gained widespread popularity in recent years. As BART becomes more mainstream, there is an increased need for a paper that walks readers through the details of BART, from what it is to why it works. This tutorial is aimed at providing such a resource. In addition to explaining the different components of BART using simple examples, we also discuss a framework, the General BART model, that unifies some of the recent BART extensions, including semiparametric models, correlated outcomes, statistical matching problems in surveys, and models with weaker distributional assumptions. By showing how these models fit into a single framework, we hope to demonstrate a simple way of applying BART to research problems that go beyond the original independent continuous or binary outcomes framework. " paper: https://onlinelibrary.wiley.com/doi/full/10.1002/sim.8347 arxiv'd: https://arxiv.org/abs/1901.07504 submitted by /u/bikeskata [link] [comments]
    [P] - Tutorial: Time-Series Transformer + Time2Vec
    (4 min) [Tutorial] Transformers for Time-Series (Time2Vec) How to use transformers for Time-series data? 🤔 TLDR: The notebook Rolling Window Setup Our data comes in a table. We want to model a series. (which is not a table) We got a problem. In order to solve our problem we will transform the table to the following set-up: Each of our input vectors will be a slice of the dataset representing a rolling-window of N times-steps, For each of these vectors we will also slice a target vector corresponding to the indices of the rolling-window. In short, we create the following input and output: X shape: (samples, time-steps, features) y shape: (samples, 1) Figure: https://i.ibb.co/ZfsK9ZY/1-q-6btr-S9-S-mc-JDdc-AW-It-A.png In order for us to get a better understanding, let…
    [N] AugeLab Studio - Machine Vision made easy
    (1 min) Hello all, ​ I am one of the founders and senior programmers in AugeLab Studio, a visual programming environment that focuses on rapid development and deployment of machine vision projects. Supports yolo, tensorflow, opencv, scikit and many libraries. Currently available in Windows Machines. Take a sneak peak: https://www.youtube.com/watch?v=GSy07Kdjf70 The concept is based on nodes and connecting them to create a flow that represents an algorithm. You can also write your own nodes and integrate them on the go. A simple measuring task can be done visually like below: ​ Measuring distance between contours You can download it for free from www.augelab.com. Professional version of the software which is available for students and researchers, includes object detection algorithms(pre-trained), training object detection and labeling tools, creating custom CNNs. ​ Feel free to ask anything. submitted by /u/Suspcious-chair [link] [comments]
    [D] Alternatives to SKLearn interface?
    (2 min) The fit/predict interface introduced by SKLearn have been really successful and have shaped the ML community in the past decade, but lately I am thinking about possible alternatives or extensions. The only consistent alternative I found is the Haiku init/apply, and in general the approach of some "pure" JAX projects. Do you know any alternative approch? Do you some ideas on what a good 'ML-grammar' should be? submitted by /u/abio93 [link] [comments]
    [R] Pushing the limits of self-supervised ResNets: Can we outperform supervised learning without labels on ImageNet?
    (0 min) submitted by /u/downtownslim [link] [comments]
    Deep Learning Interviews: Hundreds of fully solved job interview questions from a wide range of key topics in AI
    (1 min) submitted by /u/ksinkar [link] [comments]
    [R] HyperTransformer: Model Generation for Supervised and Semi-Supervised Few-Shot Learning
    (0 min) submitted by /u/hardmaru [link] [comments]
  • Reinforcement Learning

    Reinforcement Learning As A Fine-Tuning Paradigm
    (0 min) submitted by /u/EducationalCicada [link] [comments]
    Leading Research Groups/ Researchers in AI Robotics & Theoretical RL
    (1 min) Hi, I would like to know who are the current leading researchers/ research groups in AI Robotics and other in Theoretical Reinforcement Learning. Groups I know: Theoretical Reinforcement Learning - Rich Sutton's Lab at UofAlberta AI Robotics Groups: Stanford AI Lab Sergey Levine's Lab I want to make a comprehensive list of this and would really appreciate any kind of help here.Thanks. submitted by /u/EqualMarsupial2759 [link] [comments]
    Sub-optimal actions in a continuous setting
    (1 min) I am currently working on a simple battery management agent (PPO, SB3), which should optimize its charge/discharge under a variety of loads. The agent converges and its actions [-1:discharge entire battery, 1: charge entire battery] are pretty good, but the discharge never matches the load exactly (optimal), there is always a deviation of +-10% or so. Is this something inherent to the algorithm/RL or am I missing something elementary? What Hyperparameter have an influence on the action distribution and are worth experimenting with? I tried an env where there is a single obs[0,1] and where the agent needs to output that exact number as its action[0,1] for max rewards and it struggles to achieve this. (accurate to 3-4 decimal points, more accurate with lower learning rate) Input would be greatly appreciated! submitted by /u/throwaway582643766 [link] [comments]
    Comprehensive list of differences between value-based and policy-based DRL
    (1 min) I'm trying to create a comprehensive list of differences between value-based and policy-based DRL. 1 The main advantage of learning parameterized policies is that policies can now be any learnable function, whereas value-based methods, in the case of continuous action spaces, are severely limited. 2 Policy-based methods can more easily learn stochastic policies, which in turn has multiple additional advantages a) Better performance under partially observable environments because we can learn arbitrary probabilities of actions and thus the agent is less dependent on the Markov assumption. In value-based methods we have to force exploration with some probability to ensure optimality, whereas in policy-based methods with stochastic policies, exploration is embedded in the learned function, and converging to a deterministic policy for a given state while training is possible b) It could be more straightforward for function approximation to represent a policy than a value function. Sometimes value functions are too much information for what’s truly needed. 3 Because policies are parameterized with continuous values, the action probabilities change smoothly as a function of the learned parameters. Therefore, policy-based methods often have better convergence properties. 4 One of the main advantages of optimizing policies directly is that it’s the right objective. We learn a policy that optimizes the value function directly, without learning a value function, and without taking into account the dynamics of the environment. 5 The main advantage of value-based deep reinforcement learning is that they are (usually) off-policy methods and when they work, they tend to be much more sample efficient than policy optimization methods, because they can exploit data from experts or other sources. Do you guys agree with this? Can you think of other points I've surely missed? Thanks! submitted by /u/No_Possibility_7588 [link] [comments]
    Multi-Agent RL in Computer Vision.
    (1 min) Can you note me some example ‏‏‎ where multi-agent RL framework is used to solve computer vision tasks like object detection, event detection, action recognition, tracking and so on. I am actually interested to know how multiple agents communicate/cooperate to solve a task. How the communication/ cooperation can be modeled. submitted by /u/Infamous-Editor5131 [link] [comments]
    In my Deep RL algorithm, my state representation is a vector of 'n' dimensions , which i need to get from k vectors of same 'n' dimensions each . What neural network technique should i use to obtain this state representation .
    (1 min) submitted by /u/aabra__ka__daabra [link] [comments]
    What is the difference between a batch and a buffer?
    (1 min) What is the difference between how NFQ grows a batch to optimize several samples at the same time and and DQN has a replay buffer that holds experience samples for several steps? submitted by /u/No_Possibility_7588 [link] [comments]
    How to integrate a trained supervised learning model into RL?
    (2 min) Hi, this might be a longshot but here we go: I have developed a very good hft classification price predictor implemented in tensorflow using some of the modern NLP approaches. HOWEVER, market making is a bit more tricky then that, hence I want to use a RL framework to develop a strategy around it (particularly with focus on inventory management) Does anybody know where to start? I've got a classification model which output (one of three classes) would work as an input signal for a RL strategy as well as the price change over time of the underlying. Metrics could be either PnL etc. I've got billions of data points too. Would be grateful if someone can point me into a good direction on where to start (I.e. combining trained SL models with RL) - or if I'm barking up the wrong tree completely. I Did Supervised learning for a long time, but not RL. Ideally i'd be looking for a smaller model. submitted by /u/stocktiger6969 [link] [comments]
    RL Reward in MATLAB
    (1 min) Hi guys, I have a dc motor Simulink in MATLAB that I want to control using RL(DQN). I implemented my agent, but it always gives me a constant reward. In the training step, the step changed but my reward is constant and gives me a constant reward. Even adding two observations gave me a constant reward while it completed the training steps. Does anybody know the reason for that? ​ ​ https://preview.redd.it/95iqswg3ulb81.png?width=771&format=png&auto=webp&s=5d499a9a0e633fd572c56405687ed31fbfb991a5 submitted by /u/Armin1371 [link] [comments]
    "Automated Reinforcement Learning (AutoRL): A Survey and Open Problems", Parker-Holder et al 2022
    (1 min) submitted by /u/gwern [link] [comments]

2022-01-13

  • cs.LG updates on arXiv.org

    FRIDA -- Generative Feature Replay for Incremental Domain Adaptation. (arXiv:2112.14316v2 [cs.CV] UPDATED)
    (2 min) We tackle the novel problem of incremental unsupervised domain adaptation (IDA) in this paper. We assume that a labeled source domain and different unlabeled target domains are incrementally observed with the constraint that data corresponding to the current domain is only available at a time. The goal is to preserve the accuracies for all the past domains while generalizing well for the current domain. The IDA setup suffers due to the abrupt differences among the domains and the unavailability of past data including the source domain. Inspired by the notion of generative feature replay, we propose a novel framework called Feature Replay based Incremental Domain Adaptation (FRIDA) which leverages a new incremental generative adversarial network (GAN) called domain-generic auxiliary classification GAN (DGAC-GAN) for producing domain-specific feature representations seamlessly. For domain alignment, we propose a simple extension of the popular domain adversarial neural network (DANN) called DANN-IB which encourages discriminative domain-invariant and task-relevant feature learning. Experimental results on Office-Home, Office-CalTech, and DomainNet datasets confirm that FRIDA maintains superior stability-plasticity trade-off than the literature.
    Universal Efficient Variable-rate Neural Image Compression. (arXiv:2111.11305v3 [eess.IV] UPDATED)
    (2 min) Recently, Learning-based image compression has reached comparable performance with traditional image codecs(such as JPEG, BPG, WebP). However, computational complexity and rate flexibility are still two major challenges for its practical deployment. To tackle these problems, this paper proposes two universal modules named Energy-based Channel Gating(ECG) and Bit-rate Modulator(BM), which can be directly embedded into existing end-to-end image compression models. ECG uses dynamic pruning to reduce FLOPs for more than 50\% in convolution layers, and a BM pair can modulate the latent representation to control the bit-rate in a channel-wise manner. By implementing these two modules, existing learning-based image codecs can obtain ability to output arbitrary bit-rate with a single model and reduced computation.
    Distilling the Knowledge of Romanian BERTs Using Multiple Teachers. (arXiv:2112.12650v2 [cs.CL] UPDATED)
    (2 min) Running large-scale pre-trained language models in computationally constrained environments remains a challenging problem yet to be addressed, while transfer learning from these models has become prevalent in Natural Language Processing tasks. Several solutions, including knowledge distillation, network quantization, or network pruning have been previously proposed; however, these approaches focus mostly on the English language, thus widening the gap when considering low-resource languages. In this work, we introduce three light and fast versions of distilled BERT models for the Romanian language: Distil-BERT-base-ro, Distil-RoBERT-base, and DistilMulti-BERT-base-ro. The first two models resulted from the individual distillation of knowledge from two base versions of Romanian BERTs available in literature, while the last one was obtained by distilling their ensemble. To our knowledge, this is the first attempt to create publicly available Romanian distilled BERT models, which were thoroughly evaluated on five tasks: part-of-speech tagging, named entity recognition, sentiment analysis, semantic textual similarity, and dialect identification. Our experimental results argue that the three distilled models maintain most performance in terms of accuracy with their teachers, while being twice as fast on a GPU and ~35% smaller. In addition, we further test the similarity between the predictions of our students versus their teachers by measuring their label and probability loyalty, together with regression loyalty - a new metric introduced in this work.
    Scalable Diverse Model Selection for Accessible Transfer Learning. (arXiv:2111.06977v2 [cs.LG] UPDATED)
    (2 min) With the preponderance of pretrained deep learning models available off-the-shelf from model banks today, finding the best weights to fine-tune to your use-case can be a daunting task. Several methods have recently been proposed to find good models for transfer learning, but they either don't scale well to large model banks or don't perform well on the diversity of off-the-shelf models. Ideally the question we want to answer is, "given some data and a source model, can you quickly predict the model's accuracy after fine-tuning?" In this paper, we formalize this setting as "Scalable Diverse Model Selection" and propose several benchmarks for evaluating on this task. We find that existing model selection and transferability estimation methods perform poorly here and analyze why this is the case. We then introduce simple techniques to improve the performance and speed of these algorithms. Finally, we iterate on existing methods to create PARC, which outperforms all other methods on diverse model selection. We have released the benchmarks and method code in hope to inspire future work in model selection for accessible transfer learning.
    Extending Environments To Measure Self-Reflection In Reinforcement Learning. (arXiv:2110.06890v2 [cs.AI] UPDATED)
    (2 min) We consider an extended notion of reinforcement learning in which the environment can simulate the agent and base its outputs on the agent's hypothetical behavior. Since good performance usually requires paying attention to whatever things the environment's outputs are based on, we argue that for an agent to achieve on-average good performance across many such extended environments, it is necessary for the agent to self-reflect. Thus, an agent's self-reflection ability can be numerically estimated by running the agent through a battery of extended environments. We are simultaneously releasing an open-source library of extended environments to serve as proof-of-concept of this technique. As the library is first-of-kind, we have avoided the difficult problem of optimizing it. Instead we have chosen environments with interesting properties. Some seem paradoxical, some lead to interesting thought experiments, some are even suggestive of how self-reflection might have evolved in nature. We give examples and introduce a simple transformation which experimentally seems to increase self-reflection.
    Improving language models by retrieving from trillions of tokens. (arXiv:2112.04426v2 [cs.CL] UPDATED)
    (2 min) We enhance auto-regressive language models by conditioning on document chunks retrieved from a large corpus, based on local similarity with preceding tokens. With a $2$ trillion token database, our Retrieval-Enhanced Transformer (RETRO) obtains comparable performance to GPT-3 and Jurassic-1 on the Pile, despite using 25$\times$ fewer parameters. After fine-tuning, RETRO performance translates to downstream knowledge-intensive tasks such as question answering. RETRO combines a frozen Bert retriever, a differentiable encoder and a chunked cross-attention mechanism to predict tokens based on an order of magnitude more data than what is typically consumed during training. We typically train RETRO from scratch, yet can also rapidly RETROfit pre-trained transformers with retrieval and still achieve good performance. Our work opens up new avenues for improving language models through explicit memory at unprecedented scale.
    Directional Message Passing on Molecular Graphs via Synthetic Coordinates. (arXiv:2111.04718v2 [cs.LG] UPDATED)
    (2 min) Graph neural networks that leverage coordinates via directional message passing have recently set the state of the art on multiple molecular property prediction tasks. However, they rely on atom position information that is often unavailable, and obtaining it is usually prohibitively expensive or even impossible. In this paper we propose synthetic coordinates that enable the use of advanced GNNs without requiring the true molecular configuration. We propose two distances as synthetic coordinates: Distance bounds that specify the rough range of molecular configurations, and graph-based distances using a symmetric variant of personalized PageRank. To leverage both distance and angular information we propose a method of transforming normal graph neural networks into directional MPNNs. We show that with this transformation we can reduce the error of a normal graph neural network by 55% on the ZINC benchmark. We furthermore set the state of the art on ZINC and coordinate-free QM9 by incorporating synthetic coordinates in the SMP and DimeNet++ models. Our implementation is available online.
    NarrationBot and InfoBot: A Hybrid System for Automated Video Description. (arXiv:2111.03994v2 [cs.HC] UPDATED)
    (2 min) Video accessibility is crucial for blind and low vision users for equitable engagements in education, employment, and entertainment. Despite the availability of professional and amateur services and tools, most human-generated descriptions are expensive and time consuming. Moreover, the rate of human-generated descriptions cannot match the speed of video production. To overcome the increasing gaps in video accessibility, we developed a hybrid system of two tools to 1) automatically generate descriptions for videos and 2) provide answers or additional descriptions in response to user queries on a video. Results from a mixed-methods study with 26 blind and low vision individuals show that our system significantly improved user comprehension and enjoyment of selected videos when both tools were used in tandem. In addition, participants reported no significant difference in their ability to understand videos when presented with autogenerated descriptions versus human-revised autogenerated descriptions. Our results demonstrate user enthusiasm about the developed system and its promise for providing customized access to videos. We discuss the limitations of the current work and provide recommendations for the future development of automated video description tools.
    Neural HMMs are all you need (for high-quality attention-free TTS). (arXiv:2108.13320v5 [eess.AS] UPDATED)
    (2 min) Neural sequence-to-sequence TTS has achieved significantly better output quality than statistical speech synthesis using HMMs. However, neural TTS is generally not probabilistic and the use of non-monotonic attention both increases training time and introduces "babbling" failure modes that are unacceptable in production. This paper demonstrates that the old and new paradigms can be combined to obtain the advantages of both worlds. In particular, we replace the attention in Tacotron 2 with an autoregressive left-right no-skip hidden Markov model defined by a neural network. This leads to an HMM-based neural TTS model with monotonic alignment, trained to maximise the full sequence likelihood without approximations. We discuss how to combine innovations from both classical and contemporary TTS for best results. The final system is smaller and simpler than Tacotron 2, and learns to speak with fewer iterations and less data, whilst achieving the same naturalness prior to the post-net. Our approach also allows easy control over speaking rate. Audio examples and code are available at https://shivammehta007.github.io/Neural-HMM/
    Pediatric Automatic Sleep Staging: Deep Learning Ensemble Improves Accuracy and Reduces Predictive Uncertainty. (arXiv:2108.10211v2 [eess.SP] UPDATED)
    (2 min) Despite the tremendous progress recently made towards automatic sleep staging in adults, it is currently unknown if the most advanced algorithms generalize to the pediatric population, which displays distinctive characteristics in overnight polysomnography (PSG). To answer the question, in this work, we conduct a large-scale comparative study on the state-of-the-art deep learning methods for pediatric automatic sleep staging. A selection of six different deep neural networks with diverging features are adopted to evaluate a sample of more than 1,200 children across a wide spectrum of obstructive sleep apnea (OSA) severity. Our experimental results show that the individual performance of automated pediatric sleep stagers when evaluated on new subjects is equivalent to the expert-level one reported on adults. Combining the six stagers into ensemble models further boosts the staging accuracy, reaching an overall accuracy of 87.7%, a Cohen's kappa of 0.837, and a macro F1-score of 84.2% in case of single-channel EEG when evaluated on new subjects. The performance is further improved when dual-channel EEG$\cdot$EOG are used, reaching an accuracy of 88.8%, a Cohen's kappa of 0.852, and a macro F1-score of 85.8%. At the same time, the ensemble models lead to reduced predictive uncertainty. The results also show that the studied algorithms and their ensembles are robust to concept drift when the training and test data were recorded 7-months apart and after clinical intervention. Detailed analyses further demonstrate "almost perfect" agreement between the automatic stagers to one another and their similar patterns on the staging errors.
    Experience Report: Deep Learning-based System Log Analysis for Anomaly Detection. (arXiv:2107.05908v2 [cs.SE] UPDATED)
    (2 min) Logs have been an imperative resource to ensure the reliability and continuity of many software systems, especially large-scale distributed systems. They faithfully record runtime information to facilitate system troubleshooting and behavior understanding. Due to the large scale and complexity of modern software systems, the volume of logs has reached an unprecedented level. Consequently, for log-based anomaly detection, conventional manual inspection methods or even traditional machine learning-based methods become impractical, which serve as a catalyst for the rapid development of deep learning-based solutions. However, there is currently a lack of rigorous comparison among the representative log-based anomaly detectors that resort to neural networks. Moreover, the re-implementation process demands non-trivial efforts, and bias can be easily introduced. To better understand the characteristics of different anomaly detectors, in this paper, we provide a comprehensive review and evaluation of five popular neural networks used by six state-of-the-art methods. Particularly, four of the selected methods are unsupervised, and the remaining two are supervised. These methods are evaluated with two publicly available log datasets, which contain nearly 16 million log messages and 0.4 million anomaly instances in total. We believe our work can serve as a basis in this field and contribute to future academic research and industrial applications.
    Towards Real-World Deployment of Reinforcement Learning for Traffic Signal Control. (arXiv:2103.16223v3 [cs.LG] UPDATED)
    (2 min) Sub-optimal control policies in intersection traffic signal controllers (TSC) contribute to congestion and lead to negative effects on human health and the environment. Reinforcement learning (RL) for traffic signal control is a promising approach to design better control policies and has attracted considerable research interest in recent years. However, most work done in this area used simplified simulation environments of traffic scenarios to train RL-based TSC. To deploy RL in real-world traffic systems, the gap between simplified simulation environments and real-world applications has to be closed. Therefore, we propose LemgoRL, a benchmark tool to train RL agents as TSC in a realistic simulation environment of Lemgo, a medium-sized town in Germany. In addition to the realistic simulation model, LemgoRL encompasses a traffic signal logic unit that ensures compliance with all regulatory and safety requirements. LemgoRL offers the same interface as the wellknown OpenAI gym toolkit to enable easy deployment in existing research work. To demonstrate the functionality and applicability of LemgoRL, we train a state-of-the-art Deep RL algorithm on a CPU cluster utilizing a framework for distributed and parallel RL and compare its performance with other methods. Our benchmark tool drives the development of RL algorithms towards real-world applications.
    KNODE-MPC: A Knowledge-based Data-driven Predictive Control Framework for Aerial Robots. (arXiv:2109.04821v3 [cs.RO] UPDATED)
    (2 min) In this work, we consider the problem of deriving and incorporating accurate dynamic models for model predictive control (MPC) with an application to quadrotor control. MPC relies on precise dynamic models to achieve the desired closed-loop performance. However, the presence of uncertainties in complex systems and the environments they operate in poses a challenge in obtaining sufficiently accurate representations of the system dynamics. In this work, we make use of a deep learning tool, knowledge-based neural ordinary differential equations (KNODE), to augment a model obtained from first principles. The resulting hybrid model encompasses both a nominal first-principle model and a neural network learnt from simulated or real-world experimental data. Using a quadrotor, we benchmark our hybrid model against a state-of-the-art Gaussian Process (GP) model and show that the hybrid model provides more accurate predictions of the quadrotor dynamics and is able to generalize beyond the training data. To improve closed-loop performance, the hybrid model is integrated into a novel MPC framework, known as KNODE-MPC. Results show that the integrated framework achieves 60.2% improvement in simulations and more than 21% in physical experiments, in terms of trajectory tracking performance.
    GemNet: Universal Directional Graph Neural Networks for Molecules. (arXiv:2106.08903v6 [physics.comp-ph] UPDATED)
    (2 min) Effectively predicting molecular interactions has the potential to accelerate molecular dynamics by multiple orders of magnitude and thus revolutionize chemical simulations. Graph neural networks (GNNs) have recently shown great successes for this task, overtaking classical methods based on fixed molecular kernels. However, they still appear very limited from a theoretical perspective, since regular GNNs cannot distinguish certain types of graphs. In this work we close this gap between theory and practice. We show that GNNs with directed edge embeddings and two-hop message passing are indeed universal approximators for predictions that are invariant to translation, and equivariant to permutation and rotation. We then leverage these insights and multiple structural improvements to propose the geometric message passing neural network (GemNet). We demonstrate the benefits of the proposed changes in multiple ablation studies. GemNet outperforms previous models on the COLL, MD17, and OC20 datasets by 34%, 41%, and 20%, respectively, and performs especially well on the most challenging molecules. Our implementation is available online.
    Meta-Learning Reliable Priors in the Function Space. (arXiv:2106.03195v2 [cs.LG] UPDATED)
    (2 min) When data are scarce meta-learning can improve a learner's accuracy by harnessing previous experience from related learning tasks. However, existing methods have unreliable uncertainty estimates which are often overconfident. Addressing these shortcomings, we introduce a novel meta-learning framework, called F-PACOH, that treats meta-learned priors as stochastic processes and performs meta-level regularization directly in the function space. This allows us to directly steer the probabilistic predictions of the meta-learner towards high epistemic uncertainty in regions of insufficient meta-training data and, thus, obtain well-calibrated uncertainty estimates. Finally, we showcase how our approach can be integrated with sequential decision making, where reliable uncertainty quantification is imperative. In our benchmark study on meta-learning for Bayesian Optimization (BO), F-PACOH significantly outperforms all other meta-learners and standard baselines.
    MED-TEX: Transferring and Explaining Knowledge with Less Data from Pretrained Medical Imaging Models. (arXiv:2008.02593v2 [cs.CV] UPDATED)
    (2 min) Deep learning methods usually require a large amount of training data and lack interpretability. In this paper, we propose a novel knowledge distillation and model interpretation framework for medical image classification that jointly solves the above two issues. Specifically, to address the data-hungry issue, a small student model is learned with less data by distilling knowledge from a cumbersome pretrained teacher model. To interpret the teacher model and assist the learning of the student, an explainer module is introduced to highlight the regions of an input that are important for the predictions of the teacher model. Furthermore, the joint framework is trained by a principled way derived from the information-theoretic perspective. Our framework outperforms on the knowledge distillation and model interpretation tasks compared to state-of-the-art methods on a fundus dataset.
    Do We Exploit all Information for Counterfactual Analysis? Benefits of Factor Models and Idiosyncratic Correction. (arXiv:2011.03996v3 [econ.EM] UPDATED)
    (2 min) Optimal pricing, i.e., determining the price level that maximizes profit or revenue of a given product, is a vital task for the retail industry. To select such a quantity, one needs first to estimate the price elasticity from the product demand. Regression methods usually fail to recover such elasticities due to confounding effects and price endogeneity. Therefore, randomized experiments are typically required. However, elasticities can be highly heterogeneous depending on the location of stores, for example. As the randomization frequently occurs at the municipal level, standard difference-in-differences methods may also fail. Possible solutions are based on methodologies to measure the effects of treatments on a single (or just a few) treated unit(s) based on counterfactuals constructed from artificial controls. For example, for each city in the treatment group, a counterfactual may be constructed from the untreated locations. In this paper, we apply a novel high-dimensional statistical method to measure the effects of price changes on daily sales from a major retailer in Brazil. The proposed methodology combines principal components (factors) and sparse regressions, resulting in a method called Factor-Adjusted Regularized Method for Treatment evaluation (\texttt{FarmTreat}). The data consist of daily sales and prices of five different products over more than 400 municipalities. The products considered belong to the \emph{sweet and candies} category and experiments have been conducted over the years of 2016 and 2017. Our results confirm the hypothesis of a high degree of heterogeneity yielding very different pricing strategies over distinct municipalities.
    FjORD: Fair and Accurate Federated Learning under heterogeneous targets with Ordered Dropout. (arXiv:2102.13451v5 [cs.LG] UPDATED)
    (2 min) Federated Learning (FL) has been gaining significant traction across different ML tasks, ranging from vision to keyboard predictions. In large-scale deployments, client heterogeneity is a fact and constitutes a primary problem for fairness, training performance and accuracy. Although significant efforts have been made into tackling statistical data heterogeneity, the diversity in the processing capabilities and network bandwidth of clients, termed as system heterogeneity, has remained largely unexplored. Current solutions either disregard a large portion of available devices or set a uniform limit on the model's capacity, restricted by the least capable participants. In this work, we introduce Ordered Dropout, a mechanism that achieves an ordered, nested representation of knowledge in deep neural networks (DNNs) and enables the extraction of lower footprint submodels without the need of retraining. We further show that for linear maps our Ordered Dropout is equivalent to SVD. We employ this technique, along with a self-distillation methodology, in the realm of FL in a framework called FjORD. FjORD alleviates the problem of client system heterogeneity by tailoring the model width to the client's capabilities. Extensive evaluation on both CNNs and RNNs across diverse modalities shows that FjORD consistently leads to significant performance gains over state-of-the-art baselines, while maintaining its nested structure.
    MICo: Improved representations via sampling-based state similarity for Markov decision processes. (arXiv:2106.08229v3 [cs.LG] UPDATED)
    (2 min) We present a new behavioural distance over the state space of a Markov decision process, and demonstrate the use of this distance as an effective means of shaping the learnt representations of deep reinforcement learning agents. While existing notions of state similarity are typically difficult to learn at scale due to high computational cost and lack of sample-based algorithms, our newly-proposed distance addresses both of these issues. In addition to providing detailed theoretical analysis, we provide empirical evidence that learning this distance alongside the value function yields structured and informative representations, including strong results on the Arcade Learning Environment benchmark.
    Learning to Maximize Speech Quality Directly Using MOS Prediction for Neural Text-to-Speech. (arXiv:2011.01174v4 [eess.AS] UPDATED)
    (2 min) Although recent neural text-to-speech (TTS) systems have achieved high-quality speech synthesis, there are cases where a TTS system generates low-quality speech, mainly caused by limited training data or information loss during knowledge distillation. Therefore, we propose a novel method to improve speech quality by training a TTS model under the supervision of perceptual loss, which measures the distance between the maximum possible speech quality score and the predicted one. We first pre-train a mean opinion score (MOS) prediction model and then train a TTS model to maximize the MOS of synthesized speech using the pre-trained MOS prediction model. The proposed method can be applied universally (i.e., regardless of the TTS model architecture or the cause of speech quality degradation) and efficiently (i.e., without increasing the inference time or model complexity). The evaluation results for the MOS and phone error rate demonstrate that our proposed approach improves previous models in terms of both naturalness and intelligibility.
    Multimodal Representations Learning Based on Mutual Information Maximization and Minimization and Identity Embedding for Multimodal Sentiment Analysis. (arXiv:2201.03969v1 [cs.LG])
    (2 min) Multimodal sentiment analysis (MSA) is a fundamental complex research problem due to the heterogeneity gap between different modalities and the ambiguity of human emotional expression. Although there have been many successful attempts to construct multimodal representations for MSA, there are still two challenges to be addressed: 1) A more robust multimodal representation needs to be constructed to bridge the heterogeneity gap and cope with the complex multimodal interactions, and 2) the contextual dynamics must be modeled effectively throughout the information flow. In this work, we propose a multimodal representation model based on Mutual information Maximization and Minimization and Identity Embedding (MMMIE). We combine mutual information maximization between modal pairs, and mutual information minimization between input data and corresponding features to mine the modal-invariant and task-related information. Furthermore, Identity Embedding is proposed to prompt the downstream network to perceive the contextual information. Experimental results on two public datasets demonstrate the effectiveness of the proposed model.
    STFL: A Temporal-Spatial Federated Learning Framework for Graph Neural Networks. (arXiv:2111.06750v2 [cs.LG] UPDATED)
    (2 min) We present a spatial-temporal federated learning framework for graph neural networks, namely STFL. The framework explores the underlying correlation of the input spatial-temporal data and transform it to both node features and adjacency matrix. The federated learning setting in the framework ensures data privacy while achieving a good model generalization. Experiments results on the sleep stage dataset, ISRUC_S3, illustrate the effectiveness of STFL on graph prediction tasks.
    rVAD: An Unsupervised Segment-Based Robust Voice Activity Detection Method. (arXiv:1906.03588v2 [cs.SD] UPDATED)
    (3 min) This paper presents an unsupervised segment-based method for robust voice activity detection (rVAD). The method consists of two passes of denoising followed by a voice activity detection (VAD) stage. In the first pass, high-energy segments in a speech signal are detected by using a posteriori signal-to-noise ratio (SNR) weighted energy difference and if no pitch is detected within a segment, the segment is considered as a high-energy noise segment and set to zero. In the second pass, the speech signal is denoised by a speech enhancement method, for which several methods are explored. Next, neighbouring frames with pitch are grouped together to form pitch segments, and based on speech statistics, the pitch segments are further extended from both ends in order to include both voiced and unvoiced sounds and likely non-speech parts as well. In the end, a posteriori SNR weighted energy difference is applied to the extended pitch segments of the denoised speech signal for detecting voice activity. We evaluate the VAD performance of the proposed method using two databases, RATS and Aurora-2, which contain a large variety of noise conditions. The rVAD method is further evaluated, in terms of speaker verification performance, on the RedDots 2016 challenge database and its noise-corrupted versions. Experiment results show that rVAD is compared favourably with a number of existing methods. In addition, we present a modified version of rVAD where computationally intensive pitch extraction is replaced by computationally efficient spectral flatness calculation. The modified version significantly reduces the computational complexity at the cost of moderately inferior VAD performance, which is an advantage when processing a large amount of data and running on low resource devices. The source code of rVAD is made publicly available.
    A general class of surrogate functions for stable and efficient reinforcement learning. (arXiv:2108.05828v2 [cs.LG] UPDATED)
    (2 min) Common policy gradient methods rely on the maximization of a sequence of surrogate functions. In recent years, many such surrogate functions have been proposed, most without strong theoretical guarantees, leading to algorithms such as TRPO, PPO or MPO. Rather than design yet another surrogate function, we instead propose a general framework (FMA-PG) based on functional mirror ascent that gives rise to an entire family of surrogate functions. We construct surrogate functions that enable policy improvement guarantees, a property not shared by most existing surrogate functions. Crucially, these guarantees hold regardless of the choice of policy parameterization. Moreover, a particular instantiation of FMA-PG recovers important implementation heuristics (e.g., using forward vs reverse KL divergence) resulting in a variant of TRPO with additional desirable properties. Via experiments on simple bandit problems, we evaluate the algorithms instantiated by FMA-PG. The proposed framework also suggests an improved variant of PPO, whose robustness and efficiency we empirically demonstrate on the MuJoCo suite.
    Bootstrapping Informative Graph Augmentation via A Meta Learning Approach. (arXiv:2201.03812v1 [cs.LG])
    (2 min) Recent works explore learning graph representations in a self-supervised manner. In graph contrastive learning, benchmark methods apply various graph augmentation approaches. However, most of the augmentation methods are non-learnable, which causes the issue of generating unbeneficial augmented graphs. Such augmentation may degenerate the representation ability of graph contrastive learning methods. Therefore, we motivate our method to generate augmented graph by a learnable graph augmenter, called MEta Graph Augmentation (MEGA). We then clarify that a "good" graph augmentation must have uniformity at the instance-level and informativeness at the feature-level. To this end, we propose a novel approach to learning a graph augmenter that can generate an augmentation with uniformity and informativeness. The objective of the graph augmenter is to promote our feature extraction network to learn a more discriminative feature representation, which motivates us to propose a meta-learning paradigm. Empirically, the experiments across multiple benchmark datasets demonstrate that MEGA outperforms the state-of-the-art methods in graph self-supervised learning tasks. Further experimental studies prove the effectiveness of different terms of MEGA.
    Causal Inference in medicine and in health policy, a summary. (arXiv:2105.04655v4 [cs.LG] UPDATED)
    (2 min) A data science task can be deemed as making sense of the data or testing a hypothesis about it. The conclusions inferred from data can greatly guide us to make informative decisions. Big data has enabled us to carry out countless prediction tasks in conjunction with machine learning, such as identifying high risk patients suffering from a certain disease and taking preventable measures. However, healthcare practitioners are not content with mere predictions - they are also interested in the cause-effect relation between input features and clinical outcomes. Understanding such relations will help doctors treat patients and reduce the risk effectively. Causality is typically identified by randomized controlled trials. Often such trials are not feasible when scientists and researchers turn to observational studies and attempt to draw inferences. However, observational studies may also be affected by selection and/or confounding biases that can result in wrong causal conclusions. In this chapter, we will try to highlight some of the drawbacks that may arise in traditional machine learning and statistical approaches to analyze the observational data, particularly in the healthcare data analytics domain. We will discuss causal inference and ways to discover the cause-effect from observational studies in healthcare domain. Moreover, we will demonstrate the applications of causal inference in tackling some common machine learning issues such as missing data and model transportability. Finally, we will discuss the possibility of integrating reinforcement learning with causality as a way to counter confounding bias.
    The Ridgelet Prior: A Covariance Function Approach to Prior Specification for Bayesian Neural Networks. (arXiv:2010.08488v4 [stat.ML] UPDATED)
    (2 min) Bayesian neural networks attempt to combine the strong predictive performance of neural networks with formal quantification of uncertainty associated with the predictive output in the Bayesian framework. However, it remains unclear how to endow the parameters of the network with a prior distribution that is meaningful when lifted into the output space of the network. A possible solution is proposed that enables the user to posit an appropriate Gaussian process covariance function for the task at hand. Our approach constructs a prior distribution for the parameters of the network, called a ridgelet prior, that approximates the posited Gaussian process in the output space of the network. In contrast to existing work on the connection between neural networks and Gaussian processes, our analysis is non-asymptotic, with finite sample-size error bounds provided. This establishes the universality property that a Bayesian neural network can approximate any Gaussian process whose covariance function is sufficiently regular. Our experimental assessment is limited to a proof-of-concept, where we demonstrate that the ridgelet prior can out-perform an unstructured prior on regression problems for which a suitable Gaussian process prior can be provided.
    Sentiment Analysis with Deep Learning Models: A Comparative Study on a Decade of Sinhala Language Facebook Data. (arXiv:2201.03941v1 [cs.CL])
    (2 min) The relationship between Facebook posts and the corresponding reaction feature is an interesting subject to explore and understand. To archive this end, we test state-of-the-art Sinhala sentiment analysis models against a data set containing a decade worth of Sinhala posts with millions of reactions. For the purpose of establishing benchmarks and with the goal of identifying the best model for Sinhala sentiment analysis, we also test, on the same data set configuration, other deep learning models catered for sentiment analysis. In this study we report that the 3 layer Bidirectional LSTM model achieves an F1 score of 84.58% for Sinhala sentiment analysis, surpassing the current state-of-the-art model; Capsule B, which only manages to get an F1 score of 82.04%. Further, since all the deep learning models show F1 scores above 75% we conclude that it is safe to claim that Facebook reactions are suitable to predict the sentiment of a text.
    RNNs on Monitoring Physical Activity Energy Expenditure in Older People. (arXiv:2006.01169v2 [eess.SP] UPDATED)
    (3 min) Through the quantification of physical activity energy expenditure (PAEE), health care monitoring has the potential to stimulate vital and healthy ageing, inducing behavioural changes in older people and linking these to personal health gains. To be able to measure PAEE in a monitoring environment, methods from wearable accelerometers have been developed, however, mainly targeted towards younger people. Since elderly subjects differ in energy requirements and range of physical activities, the current models may not be suitable for estimating PAEE among the elderly. Because past activities influence present PAEE, we propose a modeling approach known for its ability to model sequential data, the Recurrent Neural Network (RNN). To train the RNN for an elderly population, we used the GOTOV dataset with 34 healthy participants of 60 years and older (mean 65 years old), performing 16 different activities. We used accelerometers placed on wrist and ankle, and measurements of energy counts by means of indirect calorimetry. After optimization, we propose an architecture consisting of an RNN with 3 GRU layers and a feedforward network combining both accelerometer and participant-level data. In this paper, we describe our efforts to go beyond the standard facilities of a GRU-based RNN, with the aim of achieving accuracy surpassing the state of the art. These efforts include switching aggregation function from mean to dispersion measures (SD, IQR, ...), combining temporal and static data (person-specific details such as age, weight, BMI) and adding symbolic activity data as predicted by a previously trained ML model. The resulting architecture manages to increase its performance by approximatelly 10% while decreasing training input by a factor of 10. It can thus be employed to investigate associations of PAEE with vitality parameters related to metabolic and cognitive health and mental well-being.
    Hybrid Car-Following Strategy based on Deep Deterministic Policy Gradient and Cooperative Adaptive Cruise Control. (arXiv:2103.03796v2 [cs.AI] UPDATED)
    (2 min) Deep deterministic policy gradient (DDPG)-based car-following strategy can break through the constraints of the differential equation model due to the ability of exploration on complex environments. However, the car-following performance of DDPG is usually degraded by unreasonable reward function design, insufficient training, and low sampling efficiency. In order to solve this kind of problem, a hybrid car-following strategy based on DDPG and cooperative adaptive cruise control (CACC) is proposed. First, the car-following process is modeled as the Markov decision process to calculate CACC and DDPG simultaneously at each frame. Given a current state, two actions are obtained from CACC and DDPG, respectively. Then, an optimal action, corresponding to the one offering a larger reward, is chosen as the output of the hybrid strategy. Meanwhile, a rule is designed to ensure that the change rate of acceleration is smaller than the desired value. Therefore, the proposed strategy not only guarantees the basic performance of car-following through CACC but also makes full use of the advantages of exploration on complex environments via DDPG. Finally, simulation results show that the car-following performance of the proposed strategy is improved compared with that of DDPG and CACC.
    Systematic Literature Review: Quantum Machine Learning and its applications. (arXiv:2201.04093v1 [quant-ph])
    (2 min) Quantum computing is the process of performing calculations using quantum mechanics. This field studies the quantum behavior of certain subatomic particles for subsequent use in performing calculations, as well as for large-scale information processing. These capabilities can give quantum computers an advantage in terms of computational time and cost over classical computers. Nowadays, there are scientific challenges that are impossible to perform by classical computation due to computational complexity or the time the calculation would take, and quantum computation is one of the possible answers. However, current quantum devices have not yet the necessary qubits and are not fault-tolerant enough to achieve these goals. Nonetheless, there are other fields like machine learning or chemistry where quantum computation could be useful with current quantum devices. This manuscript aims to present a Systematic Literature Review of the papers published between 2017 and 2021 to identify, analyze and classify the different algorithms used in quantum machine learning and their applications. Consequently, this study identified 52 articles that used quantum machine learning techniques and algorithms. The main types of found algorithms are quantum implementations of classical machine learning algorithms, such as support vector machines or the k-nearest neighbor model, and classical deep learning algorithms, like quantum neural networks. Many articles try to solve problems currently answered by classical machine learning but using quantum devices and algorithms. Even though results are promising, quantum machine learning is far from achieving its full potential. An improvement in the quantum hardware is required since the existing quantum computers lack enough quality, speed, and scale to allow quantum computing to achieve its full potential.
    DensE: An Enhanced Non-commutative Representation for Knowledge Graph Embedding with Adaptive Semantic Hierarchy. (arXiv:2008.04548v2 [cs.AI] UPDATED)
    (2 min) Capturing the composition patterns of relations is a vital task in knowledge graph completion. It also serves as a fundamental step towards multi-hop reasoning over learned knowledge. Previously, several rotation-based translational methods have been developed to model composite relations using the product of a series of complex-valued diagonal matrices. However, these methods tend to make several oversimplified assumptions on the composite relations, e.g., forcing them to be commutative, independent from entities and lacking semantic hierarchy. To systematically tackle these problems, we have developed a novel knowledge graph embedding method, named DensE, to provide an improved modeling scheme for the complex composition patterns of relations. In particular, our method decomposes each relation into an SO(3) group-based rotation operator and a scaling operator in the three dimensional (3-D) Euclidean space. This design principle leads to several advantages of our method: (1) For composite relations, the corresponding diagonal relation matrices can be non-commutative, reflecting a predominant scenario in real world applications; (2) Our model preserves the natural interaction between relational operations and entity embeddings; (3) The scaling operation provides the modeling power for the intrinsic semantic hierarchical structure of entities; (4) The enhanced expressiveness of DensE is achieved with high computational efficiency in terms of both parameter size and training time; and (5) Modeling entities in Euclidean space instead of quaternion space keeps the direct geometrical interpretations of relational patterns. Experimental results on multiple benchmark knowledge graphs show that DensE outperforms the current state-of-the-art models for missing link prediction, especially on composite relations.
    Designing ECG Monitoring Healthcare System with Federated Transfer Learning and Explainable AI. (arXiv:2105.12497v2 [cs.LG] UPDATED)
    (2 min) Deep learning play a vital role in classifying different arrhythmias using the electrocardiography (ECG) data. Nevertheless, training deep learning models normally requires a large amount of data and it can lead to privacy concerns. Unfortunately, a large amount of healthcare data cannot be easily collected from a single silo. Additionally, deep learning models are like black-box, with no explainability of the predicted results, which is often required in clinical healthcare. This limits the application of deep learning in real-world health systems. In this paper, we design a new explainable artificial intelligence (XAI) based deep learning framework in a federated setting for ECG-based healthcare applications. The federated setting is used to solve issues such as data availability and privacy concerns. Furthermore, the proposed framework setting effectively classifies arrhythmia's using an autoencoder and a classifier, both based on a convolutional neural network (CNN). Additionally, we propose an XAI-based module on top of the proposed classifier to explain the classification results, which help clinical practitioners make quick and reliable decisions. The proposed framework was trained and tested using the MIT-BIH Arrhythmia database. The classifier achieved accuracy up to 94% and 98% for arrhythmia detection using noisy and clean data, respectively, with five-fold cross-validation.
    Emotion Intensity and its Control for Emotional Voice Conversion. (arXiv:2201.03967v1 [cs.SD])
    (2 min) Emotional voice conversion (EVC) seeks to convert the emotional state of an utterance while preserving the linguistic content and speaker identity. In EVC, emotions are usually treated as discrete categories overlooking the fact that speech also conveys emotions with various intensity levels that the listener can perceive. In this paper, we aim to explicitly characterize and control the intensity of emotion. We propose to disentangle the speaker style from linguistic content and encode the speaker style into a style embedding in a continuous space that forms the prototype of emotion embedding. We further learn the actual emotion encoder from an emotion-labelled database and study the use of relative attributes to represent fine-grained emotion intensity. To ensure emotional intelligibility, we incorporate emotion classification loss and emotion embedding similarity loss into the training of the EVC network. As desired, the proposed network controls the fine-grained emotion intensity in the output speech. Through both objective and subjective evaluations, we validate the effectiveness of the proposed network for emotional expressiveness and emotion intensity control.
    quantum Case-Based Reasoning (qCBR). (arXiv:2104.00409v2 [cs.AI] UPDATED)
    (2 min) Case-Based Reasoning (CBR) is an artificial intelligence approach to problem-solving with a good record of success. This article proposes using Quantum Computing to improve some of the key processes of CBR, such that a Quantum Case-Based Reasoning (qCBR) paradigm can be defined. The focus is set on designing and implementing a qCBR based on the variational principle that improves its classical counterpart in terms of average accuracy, scalability and tolerance to overlapping. A comparative study of the proposed qCBR with a classic CBR is performed for the case of the Social Workers' Problem as a sample of a combinatorial optimization problem with overlapping. The algorithm's quantum feasibility is modelled with docplex and tested on IBMQ computers, and experimented on the Qibo framework.
    Representation Ensembling for Synergistic Lifelong Learning with Quasilinear Complexity. (arXiv:2004.12908v12 [cs.AI] UPDATED)
    (3 min) In biological learning, data are used to improve performance not only on the current task, but also on previously encountered, and as yet unencountered tasks. In contrast, classical machine learning starts from a blank slate, or tabula rasa, using data only for the single task at hand. While typical transfer learning algorithms can improve performance on future tasks, their performance on prior tasks degrades upon learning new tasks (called catastrophic forgetting). Many recent approaches for continual or lifelong learning have attempted to maintain performance given new tasks. But striving to avoid forgetting sets the goal unnecessarily low: the goal of lifelong learning, whether biological or artificial, should be to improve performance on both past and future tasks with any new data. Our key insight is that we can ensemble representations learned independently across tasks to transfer omnidirectionally, that is, jointly improve performance on both future unforeseen tasks (forward transfer) and past tasks (backward transfer). Specifically, we propose two lifelong learners: one ensembling trees and the other ensembling networks. In both cases, the algorithms are able to learn synergistically, improving performance on both past and future tasks in a variety of simulated and real data scenarios, including tabular data, image data, spoken data, and adversarial tasks. Moreover, they can do so with quasilinear space and time complexity, a requirement for any bona fide lifelong learning system.
    Data transformation based optimized customer churn prediction model for the telecommunication industry. (arXiv:2201.04088v1 [cs.LG])
    (2 min) Data transformation (DT) is a process that transfers the original data into a form which supports a particular classification algorithm and helps to analyze the data for a special purpose. To improve the prediction performance we investigated various data transform methods. This study is conducted in a customer churn prediction (CCP) context in the telecommunication industry (TCI), where customer attrition is a common phenomenon. We have proposed a novel approach of combining data transformation methods with the machine learning models for the CCP problem. We conducted our experiments on publicly available TCI datasets and assessed the performance in terms of the widely used evaluation measures (e.g. AUC, precision, recall, and F-measure). In this study, we presented comprehensive comparisons to affirm the effect of the transformation methods. The comparison results and statistical test proved that most of the proposed data transformation based optimized models improve the performance of CCP significantly. Overall, an efficient and optimized CCP model for the telecommunication industry has been presented through this manuscript.
    Spectrum Surveying: Active Radio Map Estimation with Autonomous UAVs. (arXiv:2201.04125v1 [eess.SP])
    (2 min) Radio maps find numerous applications in wireless communications and mobile robotics tasks, including resource allocation, interference coordination, and mission planning. Although numerous techniques have been proposed to construct radio maps from spatially distributed measurements, the locations of such measurements are assumed predetermined beforehand. In contrast, this paper proposes spectrum surveying, where a mobile robot such as an unmanned aerial vehicle (UAV) collects measurements at a set of locations that are actively selected to obtain high-quality map estimates in a short surveying time. This is performed in two steps. First, two novel algorithms, a model-based online Bayesian estimator and a data-driven deep learning algorithm, are devised for updating a map estimate and an uncertainty metric that indicates the informativeness of measurements at each possible location. These algorithms offer complementary benefits and feature constant complexity per measurement. Second, the uncertainty metric is used to plan the trajectory of the UAV to gather measurements at the most informative locations. To overcome the combinatorial complexity of this problem, a dynamic programming approach is proposed to obtain lists of waypoints through areas of large uncertainty in linear time. Numerical experiments conducted on a realistic dataset confirm that the proposed scheme constructs accurate radio maps quickly.
    Deep Neural Network Approximation For H\"older Functions. (arXiv:2201.03747v1 [cs.LG])
    (2 min) In this work, we explore the approximation capability of deep Rectified Quadratic Unit neural networks for H\"older-regular functions, with respect to the uniform norm. We find that theoretical approximation heavily depends on the selected activation function in the neural network.
    M2: Mixed Models with Preferences, Popularities and Transitions for Next-Basket Recommendation. (arXiv:2004.01646v3 [cs.LG] UPDATED)
    (2 min) Next-basket recommendation considers the problem of recommending a set of items into the next basket that users will purchase as a whole. In this paper, we develop a novel mixed model with preferences, popularities and transitions (M2) for the next-basket recommendation. This method models three important factors in next-basket generation process: 1) users' general preferences, 2) items' global popularities and 3) transition patterns among items. Unlike existing recurrent neural network-based approaches, M2 does not use the complicated networks to model the transitions among items, or generate embeddings for users. Instead, it has a simple encoder-decoder based approach (ed-Trans) to better model the transition patterns among items. We compared M2 with different combinations of the factors with 5 state-of-the-art next-basket recommendation methods on 4 public benchmark datasets in recommending the first, second and third next basket. Our experimental results demonstrate that M2 significantly outperforms the state-of-the-art methods on all the datasets in all the tasks, with an improvement of up to 22.1%. In addition, our ablation study demonstrates that the ed-Trans is more effective than recurrent neural networks in terms of the recommendation performance. We also have a thorough discussion on various experimental protocols and evaluation metrics for next-basket recommendation evaluation.
    DynO: Dynamic Onloading of Deep Neural Networks from Cloud to Device. (arXiv:2104.09949v2 [cs.DC] UPDATED)
    (2 min) Recently, there has been an explosive growth of mobile and embedded applications using convolutional neural networks(CNNs). To alleviate their excessive computational demands, developers have traditionally resorted to cloud offloading, inducing high infrastructure costs and a strong dependence on networking conditions. On the other end, the emergence of powerful SoCs is gradually enabling on-device execution. Nonetheless, low- and mid-tier platforms still struggle to run state-of-the-art CNNs sufficiently. In this paper, we present DynO, a distributed inference framework that combines the best of both worlds to address several challenges, such as device heterogeneity, varying bandwidth and multi-objective requirements. Key components that enable this are its novel CNN-specific data packing method, which exploits the variability of precision needs in different parts of the CNN when onloading computation, and its novel scheduler that jointly tunes the partition point and transferred data precision at run time to adapt inference to its execution environment. Quantitative evaluation shows that DynO outperforms the current state-of-the-art, improving throughput by over an order of magnitude over device-only execution and up to 7.9x over competing CNN offloading systems, with up to 60x less data transferred.
    Learning to Denoise Raw Mobile UI Layouts for ImprovingDatasets at Scale. (arXiv:2201.04100v1 [cs.HC])
    (2 min) The layout of a mobile screen is a critical data source for UI designresearch and semantic understanding of the screen. However, UIlayouts in existing datasets are often noisy, have mismatches withtheir visual representation, or consists of generic or app-specifictypes that are difficult to analyze and model. In this paper, wepropose the CLAY pipeline that uses a deep learning approach fordenoising UI layouts, allowing us to automatically improve existingmobile UI layout datasets at scale. Our pipeline takes both thescreenshot and the raw UI layout, and annotates the raw layout byremoving incorrect nodes and assigning a semantically meaningfultype to each node. To experiment with our data-cleaning pipeline,we create the CLAY dataset of 59,555 human-annotated screenlayouts, based on screenshots and raw layouts from Rico, a publicmobile UI corpus. Our deep models achieve high accuracy withF1 scores of 82.7% for detecting layout objects that do not have avalid visual representation and 85.9% for recognizing object types,which significantly outperforms a heuristic baseline. Our work laysa foundation for creating large-scale high quality UI layout datasetsfor data-driven mobile UI research and reduces the need of manuallabeling efforts that are prohibitively expensive.
    Automated Reinforcement Learning (AutoRL): A Survey and Open Problems. (arXiv:2201.03916v1 [cs.LG])
    (2 min) The combination of Reinforcement Learning (RL) with deep learning has led to a series of impressive feats, with many believing (deep) RL provides a path towards generally capable agents. However, the success of RL agents is often highly sensitive to design choices in the training process, which may require tedious and error-prone manual tuning. This makes it challenging to use RL for new problems, while also limits its full potential. In many other areas of machine learning, AutoML has shown it is possible to automate such design choices and has also yielded promising initial results when applied to RL. However, Automated Reinforcement Learning (AutoRL) involves not only standard applications of AutoML but also includes additional challenges unique to RL, that naturally produce a different set of methods. As such, AutoRL has been emerging as an important area of research in RL, providing promise in a variety of applications from RNA design to playing games such as Go. Given the diversity of methods and environments considered in RL, much of the research has been conducted in distinct subfields, ranging from meta-learning to evolution. In this survey we seek to unify the field of AutoRL, we provide a common taxonomy, discuss each area in detail and pose open problems which would be of interest to researchers going forward.
    An Embedding Framework for Consistent Polyhedral Surrogates. (arXiv:1907.07330v2 [cs.LG] UPDATED)
    (2 min) We formalize and study the natural approach of designing convex surrogate loss functions via embeddings, for problems such as classification, ranking, or structured prediction. In this approach, one embeds each of the finitely many predictions (e.g.\ rankings) as a point in $\mathbb{R}^d$, assigns the original loss values to these points, and "convexifies" the loss in some way to obtain a surrogate. We establish a strong connection between this approach and polyhedral (piecewise-linear convex) surrogate losses. Given any polyhedral loss $L$, we give a construction of a link function through which $L$ is a consistent surrogate for the loss it embeds. Conversely, we show how to construct a consistent polyhedral surrogate for any given discrete loss. Our framework yields succinct proofs of consistency or inconsistency of various polyhedral surrogates in the literature, and for inconsistent surrogates, it further reveals the discrete losses for which these surrogates are consistent. We show some additional structure of embeddings, such as the equivalence of embedding and matching Bayes risks, and the equivalence of various notions of non-redudancy. Using these results, we establish that indirect elicitation, a necessary condition for consistency, is also sufficient when working with polyhedral surrogates.
    ExBrainable: An Open-Source GUI for CNN-based EEG Decoding and Model Interpretation. (arXiv:2201.04065v1 [eess.SP])
    (2 min) We have developed a graphic user interface (GUI), ExBrainable, dedicated to convolutional neural networks (CNN) model training and visualization in electroencephalography (EEG) decoding. Available functions include model training, evaluation, and parameter visualization in terms of temporal and spatial representations. We demonstrate these functions using a well-studied public dataset of motor-imagery EEG and compare the results with existing knowledge of neuroscience. The primary objective of ExBrainable is to provide a fast, simplified, and user-friendly solution of EEG decoding for investigators across disciplines to leverage cutting-edge methods in brain/neuroscience research.
    On the Efficacy of Co-Attention Transformer Layers in Visual Question Answering. (arXiv:2201.03965v1 [cs.CV])
    (2 min) In recent years, multi-modal transformers have shown significant progress in Vision-Language tasks, such as Visual Question Answering (VQA), outperforming previous architectures by a considerable margin. This improvement in VQA is often attributed to the rich interactions between vision and language streams. In this work, we investigate the efficacy of co-attention transformer layers in helping the network focus on relevant regions while answering the question. We generate visual attention maps using the question-conditioned image attention scores in these co-attention layers. We evaluate the effect of the following critical components on visual attention of a state-of-the-art VQA model: (i) number of object region proposals, (ii) question part of speech (POS) tags, (iii) question semantics, (iv) number of co-attention layers, and (v) answer accuracy. We compare the neural network attention maps against human attention maps both qualitatively and quantitatively. Our findings indicate that co-attention transformer modules are crucial in attending to relevant regions of the image given a question. Importantly, we observe that the semantic meaning of the question is not what drives visual attention, but specific keywords in the question do. Our work sheds light on the function and interpretation of co-attention transformer layers, highlights gaps in current networks, and can guide the development of future VQA models and networks that simultaneously process visual and language streams.
    In Defense of the Unitary Scalarization for Deep Multi-Task Learning. (arXiv:2201.04122v1 [cs.LG])
    (2 min) Recent multi-task learning research argues against unitary scalarization, where training simply minimizes the sum of the task losses. Several ad-hoc multi-task optimization algorithms have instead been proposed, inspired by various hypotheses about what makes multi-task settings difficult. The majority of these optimizers require per-task gradients, and introduce significant memory, runtime, and implementation overhead. We present a theoretical analysis suggesting that many specialized multi-task optimizers can be interpreted as forms of regularization. Moreover, we show that, when coupled with standard regularization and stabilization techniques from single-task learning, unitary scalarization matches or improves upon the performance of complex multi-task optimizers in both supervised and reinforcement learning settings. We believe our results call for a critical reevaluation of recent research in the area.
    Application of Common Spatial Patterns in Gravitational Waves Detection. (arXiv:2201.04086v1 [gr-qc])
    (2 min) Common Spatial Patterns (CSP) is a feature extraction algorithm widely used in Brain-Computer Interface (BCI) Systems for detecting Event-Related Potentials (ERPs) in multi-channel magneto/electroencephalography (MEG/EEG) time series data. In this article, we develop and apply a CSP algorithm to the problem of identifying whether a given epoch of multi-detector Gravitational Wave (GW) strains contains coalescenses. Paired with Signal Processing techniques and a Logistic Regression classifier, we find that our pipeline is correctly able to detect 76 out of 82 confident events from Gravitational Wave Transient Catalog, using H1 and L1 strains, with a classification score of $93.72 \pm 0.04\%$ using $10 \times 5$ cross validation. The false negative events were: GW170817-v3, GW191219 163120-v1, GW200115 042309-v2, GW200210 092254-v1, GW200220 061928-v1, and GW200322 091133-v1.
    Improving ECG Classification Interpretability using Saliency Maps. (arXiv:2201.04070v1 [eess.SP])
    (2 min) Cardiovascular disease is a large worldwide healthcare issue; symptoms often present suddenly with minimal warning. The electrocardiogram (ECG) is a fast, simple and reliable method of evaluating the health of the heart, by measuring electrical activity recorded through electrodes placed on the skin. ECGs often need to be analyzed by a cardiologist, taking time which could be spent on improving patient care and outcomes. Because of this, automatic ECG classification systems using machine learning have been proposed, which can learn complex interactions between ECG features and use this to detect abnormalities. However, algorithms built for this purpose often fail to generalize well to unseen data, reporting initially impressive results which drop dramatically when applied to new environments. Additionally, machine learning algorithms suffer a "black-box" issue, in which it is difficult to determine how a decision has been made. This is vital for applications in healthcare, as clinicians need to be able to verify the process of evaluation in order to trust the algorithm. This paper proposes a method for visualizing model decisions across each class in the MIT-BIH arrhythmia dataset, using adapted saliency maps averaged across complete classes to determine what patterns are being learned. We do this by building two algorithms based on state-of-the-art models. This paper highlights how these maps can be used to find problems in the model which could be affecting generalizability and model performance. Comparing saliency maps across complete classes gives an overall impression of confounding variables or other biases in the model, unlike what would be highlighted when comparing saliency maps on an ECG-by-ECG basis.
    Image quality measurements and denoising using Fourier Ring Correlations. (arXiv:2201.03992v1 [eess.IV])
    (2 min) Image quality is a nebulous concept with different meanings to different people. To quantify image quality a relative difference is typically calculated between a corrupted image and a ground truth image. But what metric should we use for measuring this difference? Ideally, the metric should perform well for both natural and scientific images. The structural similarity index (SSIM) is a good measure for how humans perceive image similarities, but is not sensitive to differences that are scientifically meaningful in microscopy. In electron and super-resolution microscopy, the Fourier Ring Correlation (FRC) is often used, but is little known outside of these fields. Here we show that the FRC can equally well be applied to natural images, e.g. the Google Open Images dataset. We then define a loss function based on the FRC, show that it is analytically differentiable, and use it to train a U-net for denoising of images. This FRC-based loss function allows the network to train faster and achieve similar or better results than when using L1- or L2- based losses. We also investigate the properties and limitations of neural network denoising with the FRC analysis.
    A hybrid estimation of distribution algorithm for joint stratification and sample allocation. (arXiv:2201.04068v1 [stat.ME])
    (2 min) In this study we propose a hybrid estimation of distribution algorithm (HEDA) to solve the joint stratification and sample allocation problem. This is a complex problem in which each the quality of each stratification from the set of all possible stratifications is measured its optimal sample allocation. EDAs are stochastic black-box optimization algorithms which can be used to estimate, build and sample probability models in the search for an optimal stratification. In this paper we enhance the exploitation properties of the EDA by adding a simulated annealing algorithm to make it a hybrid EDA. Results of empirical comparisons for atomic and continuous strata show that the HEDA attains the bests results found so far when compared to benchmark tests on the same data using a grouping genetic algorithm, simulated annealing algorithm or hill-climbing algorithm. However, the execution times and total execution are, in general, higher for the HEDA.
    Differentially Describing Groups of Graphs. (arXiv:2201.04064v1 [cs.SI])
    (2 min) How does neural connectivity in autistic children differ from neural connectivity in healthy children or autistic youths? What patterns in global trade networks are shared across classes of goods, and how do these patterns change over time? Answering questions like these requires us to differentially describe groups of graphs: Given a set of graphs and a partition of these graphs into groups, discover what graphs in one group have in common, how they systematically differ from graphs in other groups, and how multiple groups of graphs are related. We refer to this task as graph group analysis, which seeks to describe similarities and differences between graph groups by means of statistically significant subgraphs. To perform graph group analysis, we introduce Gragra, which uses maximum entropy modeling to identify a non-redundant set of subgraphs with statistically significant associations to one or more graph groups. Through an extensive set of experiments on a wide range of synthetic and real-world graph groups, we confirm that Gragra works well in practice.
    A novel method for error analysis in radiation thermometry with application to industrial furnaces. (arXiv:2201.04069v1 [eess.SP])
    (2 min) Accurate temperature measurements are essential for the proper monitoring and control of industrial furnaces. However, measurement uncertainty is a risk for such a critical parameter. Certain instrumental and environmental errors must be considered when using spectral-band radiation thermometry techniques, such as the uncertainty in the emissivity of the target surface, reflected radiation from surrounding objects, or atmospheric absorption and emission, to name a few. Undesired contributions to measured radiation can be isolated using measurement models, also known as error-correction models. This paper presents a methodology for budgeting significant sources of error and uncertainty during temperature measurements in a petrochemical furnace scenario. A continuous monitoring system is also presented, aided by a deep-learning-based measurement correction model, to allow domain experts to analyze the furnace's operation in real-time. To validate the proposed system's functionality, a real-world application case in a petrochemical plant is presented. The proposed solution demonstrates the viability of precise industrial furnace monitoring, thereby increasing operational security and improving the efficiency of such energy-intensive systems.
    Training-Free Uncertainty Estimation for Dense Regression: Sensitivity as a Surrogate. (arXiv:1910.04858v3 [cs.CV] UPDATED)
    (2 min) Uncertainty estimation is an essential step in the evaluation of the robustness for deep learning models in computer vision, especially when applied in risk-sensitive areas. However, most state-of-the-art deep learning models either fail to obtain uncertainty estimation or need significant modification (e.g., formulating a proper Bayesian treatment) to obtain it. Most previous methods are not able to take an arbitrary model off the shelf and generate uncertainty estimation without retraining or redesigning it. To address this gap, we perform a systematic exploration into training-free uncertainty estimation for dense regression, an unrecognized yet important problem, and provide a theoretical construction justifying such estimations. We propose three simple and scalable methods to analyze the variance of outputs from a trained network under tolerable perturbations: infer-transformation, infer-noise, and infer-dropout. They operate solely during the inference, without the need to re-train, re-design, or fine-tune the models, as typically required by state-of-the-art uncertainty estimation methods. Surprisingly, even without involving such perturbations in training, our methods produce comparable or even better uncertainty estimation when compared to training-required state-of-the-art methods.
    End-To-End Optimization of LiDAR Beam Configuration for 3D Object Detection and Localization. (arXiv:2201.03860v1 [cs.RO])
    (2 min) Existing learning methods for LiDAR-based applications use 3D points scanned under a pre-determined beam configuration, e.g., the elevation angles of beams are often evenly distributed. Those fixed configurations are task-agnostic, so simply using them can lead to sub-optimal performance. In this work, we take a new route to learn to optimize the LiDAR beam configuration for a given application. Specifically, we propose a reinforcement learning-based learning-to-optimize (RL-L2O) framework to automatically optimize the beam configuration in an end-to-end manner for different LiDAR-based applications. The optimization is guided by the final performance of the target task and thus our method can be integrated easily with any LiDAR-based application as a simple drop-in module. The method is especially useful when a low-resolution (low-cost) LiDAR is needed, for instance, for system deployment at a massive scale. We use our method to search for the beam configuration of a low-resolution LiDAR for two important tasks: 3D object detection and localization. Experiments show that the proposed RL-L2O method improves the performance in both tasks significantly compared to the baseline methods. We believe that a combination of our method with the recent advances of programmable LiDARs can start a new research direction for LiDAR-based active perception. The code is publicly available at https://github.com/vniclas/lidar_beam_selection
    Reward Relabelling for combined Reinforcement and Imitation Learning on sparse-reward tasks. (arXiv:2201.03834v1 [cs.LG])
    (2 min) During recent years, deep reinforcement learning (DRL) has made successful incursions into complex decision-making applications such as robotics, autonomous driving or video games. In the search for more sample-efficient algorithms, a promising direction is to leverage as much external off-policy data as possible. One staple of this data-driven approach is to learn from expert demonstrations. In the past, multiple ideas have been proposed to make good use of the demonstrations added to the replay buffer, such as pretraining on demonstrations only or minimizing additional cost functions. We present a new method, able to leverage demonstrations and episodes collected online in any sparse-reward environment with any off-policy algorithm. Our method is based on a reward bonus given to demonstrations and successful episodes, encouraging expert imitation and self-imitation. First, we give a reward bonus to the transitions coming from demonstrations to encourage the agent to match the demonstrated behaviour. Then, upon collecting a successful episode, we relabel its transitions with the same bonus before adding them to the replay buffer, encouraging the agent to also match its previous successes. Our experiments focus on manipulation robotics, specifically on three tasks for a 6 degrees-of-freedom robotic arm in simulation. We show that our method based on reward relabeling improves the performance of the base algorithm (SAC and DDPG) on these tasks, even in the absence of demonstrations. Furthermore, integrating into our method two improvements from previous works allows our approach to outperform all baselines.
    DDG-DA: Data Distribution Generation for Predictable Concept Drift Adaptation. (arXiv:2201.04038v1 [cs.LG])
    (2 min) In many real-world scenarios, we often deal with streaming data that is sequentially collected over time. Due to the non-stationary nature of the environment, the streaming data distribution may change in unpredictable ways, which is known as concept drift. To handle concept drift, previous methods first detect when/where the concept drift happens and then adapt models to fit the distribution of the latest data. However, there are still many cases that some underlying factors of environment evolution are predictable, making it possible to model the future concept drift trend of the streaming data, while such cases are not fully explored in previous work. In this paper, we propose a novel method DDG-DA, that can effectively forecast the evolution of data distribution and improve the performance of models. Specifically, we first train a predictor to estimate the future data distribution, then leverage it to generate training samples, and finally train models on the generated data. We conduct experiments on three real-world tasks (forecasting on stock price trend, electricity load and solar irradiance) and obtain significant improvement on multiple widely-used models.
    State Estimation in Electric Power Systems Leveraging Graph Neural Networks. (arXiv:2201.04056v1 [cs.LG])
    (2 min) The goal of the state estimation (SE) algorithm is to estimate complex bus voltages as state variables based on the available set of measurements in the power system. Because phasor measurement units (PMUs) are increasingly being used in transmission power systems, there is a need for a fast SE solver that can take advantage of PMU high sampling rates. This paper proposes training a graph neural network (GNN) to learn the estimates given the PMU voltage and current measurements as inputs, with the intent of obtaining fast and accurate predictions during the evaluation phase. GNN is trained using synthetic datasets, created by randomly sampling sets of measurements in the power system and labelling them with a solution obtained using a linear SE with PMUs solver. The presented results display the accuracy of GNN predictions in various test scenarios and tackle the sensitivity of the predictions to the missing input data.
    An Introduction to Autoencoders. (arXiv:2201.03898v1 [cs.LG])
    (2 min) In this article, we will look at autoencoders. This article covers the mathematics and the fundamental concepts of autoencoders. We will discuss what they are, what the limitations are, the typical use cases, and we will look at some examples. We will start with a general introduction to autoencoders, and we will discuss the role of the activation function in the output layer and the loss function. We will then discuss what the reconstruction error is. Finally, we will look at typical applications as dimensionality reduction, classification, denoising, and anomaly detection. This paper contains the notes of a PhD-level lecture on autoencoders given in 2021.
    Verified Probabilistic Policies for Deep Reinforcement Learning. (arXiv:2201.03698v1 [cs.AI])
    (2 min) Deep reinforcement learning is an increasingly popular technique for synthesising policies to control an agent's interaction with its environment. There is also growing interest in formally verifying that such policies are correct and execute safely. Progress has been made in this area by building on existing work for verification of deep neural networks and of continuous-state dynamical systems. In this paper, we tackle the problem of verifying probabilistic policies for deep reinforcement learning, which are used to, for example, tackle adversarial environments, break symmetries and manage trade-offs. We propose an abstraction approach, based on interval Markov decision processes, that yields probabilistic guarantees on a policy's execution, and present techniques to build and solve these models using abstract interpretation, mixed-integer linear programming, entropy-based refinement and probabilistic model checking. We implement our approach and illustrate its effectiveness on a selection of reinforcement learning benchmarks.
    Atomistic Simulations for Reactions and Spectroscopy in the Era of Machine Learning -- Quo Vadis?. (arXiv:2201.03822v1 [physics.chem-ph])
    (2 min) Atomistic simulations using accurate energy functions can provide molecular-level insight into functional motions of molecules in the gas- and in the condensed phase. Together with recently developed and currently pursued efforts in integrating and combining this with machine learning techniques provides a unique opportunity to bring such dynamics simulations closer to reality. This perspective delineates the present status of the field from efforts of others in the field and some of your own work and discusses open questions and future prospects.
    A Likelihood Ratio based Domain Adaptation Method for E2E Models. (arXiv:2201.03655v1 [cs.CL])
    (2 min) End-to-end (E2E) automatic speech recognition models like Recurrent Neural Networks Transducer (RNN-T) are becoming a popular choice for streaming ASR applications like voice assistants. While E2E models are very effective at learning representation of the training data they are trained on, their accuracy on unseen domains remains a challenging problem. Additionally, these models require paired audio and text training data, are computationally expensive and are difficult to adapt towards the fast evolving nature of conversational speech. In this work, we explore a contextual biasing approach using likelihood-ratio that leverages text data sources to adapt RNN-T model to new domains and entities. We show that this method is effective in improving rare words recognition, and results in a relative improvement of 10% in 1-best word error rate (WER) and 10% in n-best Oracle WER (n=8) on multiple out-of-domain datasets without any degradation on a general dataset. We also show that complementing the contextual biasing adaptation with adaptation of a second-pass rescoring model gives additive WER improvements.
    Moshpit SGD: Communication-Efficient Decentralized Training on Heterogeneous Unreliable Devices. (arXiv:2103.03239v4 [cs.LG] UPDATED)
    (2 min) Training deep neural networks on large datasets can often be accelerated by using multiple compute nodes. This approach, known as distributed training, can utilize hundreds of computers via specialized message-passing protocols such as Ring All-Reduce. However, running these protocols at scale requires reliable high-speed networking that is only available in dedicated clusters. In contrast, many real-world applications, such as federated learning and cloud-based distributed training, operate on unreliable devices with unstable network bandwidth. As a result, these applications are restricted to using parameter servers or gossip-based averaging protocols. In this work, we lift that restriction by proposing Moshpit All-Reduce - an iterative averaging protocol that exponentially converges to the global average. We demonstrate the efficiency of our protocol for distributed optimization with strong theoretical guarantees. The experiments show 1.3x speedup for ResNet-50 training on ImageNet compared to competitive gossip-based strategies and 1.5x speedup when training ALBERT-large from scratch using preemptible compute nodes.
    Optimal and Differentially Private Data Acquisition: Central and Local Mechanisms. (arXiv:2201.03968v1 [cs.GT])
    (2 min) We consider a platform's problem of collecting data from privacy sensitive users to estimate an underlying parameter of interest. We formulate this question as a Bayesian-optimal mechanism design problem, in which an individual can share her (verifiable) data in exchange for a monetary reward or services, but at the same time has a (private) heterogeneous privacy cost which we quantify using differential privacy. We consider two popular differential privacy settings for providing privacy guarantees for the users: central and local. In both settings, we establish minimax lower bounds for the estimation error and derive (near) optimal estimators for given heterogeneous privacy loss levels for users. Building on this characterization, we pose the mechanism design problem as the optimal selection of an estimator and payments that will elicit truthful reporting of users' privacy sensitivities. Under a regularity condition on the distribution of privacy sensitivities we develop efficient algorithmic mechanisms to solve this problem in both privacy settings. Our mechanism in the central setting can be implemented in time $\mathcal{O}(n \log n)$ where $n$ is the number of users and our mechanism in the local setting admits a Polynomial Time Approximation Scheme (PTAS).
    FairEdit: Preserving Fairness in Graph Neural Networks through Greedy Graph Editing. (arXiv:2201.03681v1 [cs.LG])
    (2 min) Graph Neural Networks (GNNs) have proven to excel in predictive modeling tasks where the underlying data is a graph. However, as GNNs are extensively used in human-centered applications, the issue of fairness has arisen. While edge deletion is a common method used to promote fairness in GNNs, it fails to consider when data is inherently missing fair connections. In this work we consider the unexplored method of edge addition, accompanied by deletion, to promote fairness. We propose two model-agnostic algorithms to perform edge editing: a brute force approach and a continuous approximation approach, FairEdit. FairEdit performs efficient edge editing by leveraging gradient information of a fairness loss to find edges that improve fairness. We find that FairEdit outperforms standard training for many data sets and GNN methods, while performing comparably to many state-of-the-art methods, demonstrating FairEdit's ability to improve fairness across many domains and models.
    Optimally compressing VC classes. (arXiv:2201.04131v1 [cs.LG])
    (1 min) Resolving a conjecture of Littlestone and Warmuth, we show that any concept class of VC-dimension $d$ has a sample compression scheme of size $d$.
    The Dataset Nutrition Label (2nd Gen): Leveraging Context to Mitigate Harms in Artificial Intelligence. (arXiv:2201.03954v1 [cs.LG])
    (2 min) As the production of and reliance on datasets to produce automated decision-making systems (ADS) increases, so does the need for processes for evaluating and interrogating the underlying data. After launching the Dataset Nutrition Label in 2018, the Data Nutrition Project has made significant updates to the design and purpose of the Label, and is launching an updated Label in late 2020, which is previewed in this paper. The new Label includes context-specific Use Cases &Alerts presented through an updated design and user interface targeted towards the data scientist profile. This paper discusses the harm and bias from underlying training data that the Label is intended to mitigate, the current state of the work including new datasets being labeled, new and existing challenges, and further directions of the work, as well as Figures previewing the new label.
    VGAER: graph neural network reconstruction based community detection. (arXiv:2201.04066v1 [cs.SI])
    (2 min) Community detection is a fundamental and important issue in network science, but there are only a few community detection algorithms based on graph neural networks, among which unsupervised algorithms are almost blank. By fusing the high-order modularity information with network features, this paper proposes a Variational Graph AutoEncoder Reconstruction based community detection VGAER for the first time, and gives its non-probabilistic version. They do not need any prior information. We have carefully designed corresponding input features, decoder, and downstream tasks based on the community detection task and these designs are concise, natural, and perform well (NMI values under our design are improved by 59.1% - 565.9%). Based on a series of experiments with wide range of datasets and advanced methods, VGAER has achieved superior performance and shows strong competitiveness and potential with a simpler design. Finally, we report the results of algorithm convergence analysis and t-SNE visualization, which clearly depicted the stable performance and powerful network modularity ability of VGAER. Our codes are available at https://github.com/qcydm/VGAER.
    DANNTe: a case study of a turbo-machinery sensor virtualization under domain shift. (arXiv:2201.03850v1 [cs.LG])
    (2 min) We propose an adversarial learning method to tackle a Domain Adaptation (DA) time series regression task (DANNTe). The regression aims at building a virtual copy of a sensor installed on a gas turbine, to be used in place of the physical sensor which can be missing in certain situations. Our DA approach is to search for a domain-invariant representation of the features. The learner has access to both a labelled source dataset and an unlabeled target dataset (unsupervised DA) and is trained on both, exploiting the minmax game between a task regressor and a domain classifier Neural Networks. Both models share the same feature representation, learnt by a feature extractor. This work is based on the results published by Ganin et al. arXiv:1505.07818; indeed, we present an extension suitable to time series applications. We report a significant improvement in regression performance, compared to the baseline model trained on the source domain only.
    Winning solutions and post-challenge analyses of the ChaLearn AutoDL challenge 2019. (arXiv:2201.03801v1 [cs.LG])
    (2 min) This paper reports the results and post-challenge analyses of ChaLearn's AutoDL challenge series, which helped sorting out a profusion of AutoML solutions for Deep Learning (DL) that had been introduced in a variety of settings, but lacked fair comparisons. All input data modalities (time series, images, videos, text, tabular) were formatted as tensors and all tasks were multi-label classification problems. Code submissions were executed on hidden tasks, with limited time and computational resources, pushing solutions that get results quickly. In this setting, DL methods dominated, though popular Neural Architecture Search (NAS) was impractical. Solutions relied on fine-tuned pre-trained networks, with architectures matching data modality. Post-challenge tests did not reveal improvements beyond the imposed time limit. While no component is particularly original or novel, a high level modular organization emerged featuring a "meta-learner", "data ingestor", "model selector", "model/learner", and "evaluator". This modularity enabled ablation studies, which revealed the importance of (off-platform) meta-learning, ensembling, and efficient data management. Experiments on heterogeneous module combinations further confirm the (local) optimality of the winning solutions. Our challenge legacy includes an ever-lasting benchmark (this http URL), the open-sourced code of the winners, and a free "AutoDL self-service".
    Online Changepoint Detection on a Budget. (arXiv:2201.03710v1 [cs.LG])
    (2 min) Changepoints are abrupt variations in the underlying distribution of data. Detecting changes in a data stream is an important problem with many applications. In this paper, we are interested in changepoint detection algorithms which operate in an online setting in the sense that both its storage requirements and worst-case computational complexity per observation are independent of the number of previous observations. We propose an online changepoint detection algorithm for both univariate and multivariate data which compares favorably with offline changepoint detection algorithms while also operating in a strictly more constrained computational model. In addition, we present a simple online hyperparameter auto tuning technique for these algorithms.
    Feature Extraction Framework based on Contrastive Learning with Adaptive Positive and Negative Samples. (arXiv:2201.03942v1 [cs.LG])
    (2 min) In this study, we propose a feature extraction framework based on contrastive learning with adaptive positive and negative samples (CL-FEFA) that is suitable for unsupervised, supervised, and semi-supervised single-view feature extraction. CL-FEFA constructs adaptively the positive and negative samples from the results of feature extraction, which makes it more appropriate and accurate. Thereafter, the discriminative features are re extracted to according to InfoNCE loss based on previous positive and negative samples, which will make the intra-class samples more compact and the inter-class samples more dispersed. At the same time, using the potential structure information of subspace samples to dynamically construct positive and negative samples can make our framework more robust to noisy data. Furthermore, CL-FEFA considers the mutual information between positive samples, that is, similar samples in potential structures, which provides theoretical support for its advantages in feature extraction. The final numerical experiments prove that the proposed framework has a strong advantage over the traditional feature extraction methods and contrastive learning methods.
    PEPit: computer-assisted worst-case analyses of first-order optimization methods in Python. (arXiv:2201.04040v1 [math.OC])
    (2 min) PEPit is a Python package aiming at simplifying the access to worst-case analyses of a large family of first-order optimization methods possibly involving gradient, projection, proximal, or linear optimization oracles, along with their approximate, or Bregman variants. In short, PEPit is a package enabling computer-assisted worst-case analyses of first-order optimization methods. The key underlying idea is to cast the problem of performing a worst-case analysis, often referred to as a performance estimation problem (PEP), as a semidefinite program (SDP) which can be solved numerically. For doing that, the package users are only required to write first-order methods nearly as they would have implemented them. The package then takes care of the SDP modelling parts, and the worst-case analysis is performed numerically via a standard solver.
    Entropic Optimal Transport in Random Graphs. (arXiv:2201.03949v1 [stat.ML])
    (2 min) In graph analysis, a classic task consists in computing similarity measures between (groups of) nodes. In latent space random graphs, nodes are associated to unknown latent variables. One may then seek to compute distances directly in the latent space, using only the graph structure. In this paper, we show that it is possible to consistently estimate entropic-regularized Optimal Transport (OT) distances between groups of nodes in the latent space. We provide a general stability result for entropic OT with respect to perturbations of the cost matrix. We then apply it to several examples of random graphs, such as graphons or $\epsilon$-graphs on manifolds. Along the way, we prove new concentration results for the so-called Universal Singular Value Thresholding estimator, and for the estimation of geodesic distances on a manifold.
    Partial Model Averaging in Federated Learning: Performance Guarantees and Benefits. (arXiv:2201.03789v1 [cs.LG])
    (2 min) Local Stochastic Gradient Descent (SGD) with periodic model averaging (FedAvg) is a foundational algorithm in Federated Learning. The algorithm independently runs SGD on multiple workers and periodically averages the model across all the workers. When local SGD runs with many workers, however, the periodic averaging causes a significant model discrepancy across the workers making the global loss converge slowly. While recent advanced optimization methods tackle the issue focused on non-IID settings, there still exists the model discrepancy issue due to the underlying periodic model averaging. We propose a partial model averaging framework that mitigates the model discrepancy issue in Federated Learning. The partial averaging encourages the local models to stay close to each other on parameter space, and it enables to more effectively minimize the global loss. Given a fixed number of iterations and a large number of workers (128), the partial averaging achieves up to 2.2% higher validation accuracy than the periodic full averaging.
    Classification of Beer Bottles using Object Detection and Transfer Learning. (arXiv:2201.03791v1 [cs.CV])
    (2 min) Classification problems are common in Computer Vision. Despite this, there is no dedicated work for the classification of beer bottles. As part of the challenge of the master course Deep Learning, a dataset of 5207 beer bottle images and brand labels was created. An image contains exactly one beer bottle. In this paper we present a deep learning model which classifies pictures of beer bottles in a two step approach. As the first step, a Faster-R-CNN detects image sections relevant for classification independently of the brand. In the second step, the relevant image sections are classified by a ResNet-18. The image section with the highest confidence is returned as class label. We propose a model, with which we surpass the classic one step transfer learning approach and reached an accuracy of 99.86% during the challenge on the final test dataset. We were able to achieve 100% accuracy after the challenge ended
    Turkish Sentiment Analysis Using Machine Learning Methods: Application on Online Food Order Site Reviews. (arXiv:2201.03848v1 [cs.CL])
    (2 min) Satisfaction measurement, which emerges in every sector today, is a very important factor for many companies. In this study, it is aimed to reach the highest accuracy rate with various machine learning algorithms by using the data on Yemek Sepeti and variations of this data. The accuracy values of each algorithm were calculated together with the various natural language processing methods used. While calculating these accuracy values, the parameters of the algorithms used were tried to be optimized. The models trained in this study on labeled data can be used on unlabeled data and can give companies an idea in measuring customer satisfaction. It was observed that 3 different natural language processing methods applied resulted in approximately 5% accuracy increase in most of the developed models.
    Competing Mutual Information Constraints with Stochastic Competition-based Activations for Learning Diversified Representations. (arXiv:2201.03624v1 [cs.LG])
    (2 min) This work aims to address the long-established problem of learning diversified representations. To this end, we combine information-theoretic arguments with stochastic competition-based activations, namely Stochastic Local Winner-Takes-All (LWTA) units. In this context, we ditch the conventional deep architectures commonly used in Representation Learning, that rely on non-linear activations; instead, we replace them with sets of locally and stochastically competing linear units. In this setting, each network layer yields sparse outputs, determined by the outcome of the competition between units that are organized into blocks of competitors. We adopt stochastic arguments for the competition mechanism, which perform posterior sampling to determine the winner of each block. We further endow the considered networks with the ability to infer the sub-part of the network that is essential for modeling the data at hand; we impose appropriate stick-breaking priors to this end. To further enrich the information of the emerging representations, we resort to information-theoretic principles, namely the Information Competing Process (ICP). Then, all the components are tied together under the stochastic Variational Bayes framework for inference. We perform a thorough experimental investigation for our approach using benchmark datasets on image classification. As we experimentally show, the resulting networks yield significant discriminative representation learning abilities. In addition, the introduced paradigm allows for a principled investigation mechanism of the emerging intermediate network representations.
    Multi-granularity Relabeled Under-sampling Algorithm for Imbalanced Data. (arXiv:2201.03957v1 [cs.LG])
    (2 min) The imbalanced classification problem turns out to be one of the important and challenging problems in data mining and machine learning. The performances of traditional classifiers will be severely affected by many data problems, such as class imbalanced problem, class overlap and noise. The Tomek-Link algorithm was only used to clean data when it was proposed. In recent years, there have been reports of combining Tomek-Link algorithm with sampling technique. The Tomek-Link sampling algorithm can effectively reduce the class overlap on data, remove the majority instances that are difficult to distinguish, and improve the algorithm classification accuracy. However, the Tomek-Links under-sampling algorithm only considers the boundary instances that are the nearest neighbors to each other globally and ignores the potential local overlapping instances. When the number of minority instances is small, the under-sampling effect is not satisfactory, and the performance improvement of the classification model is not obvious. Therefore, on the basis of Tomek-Link, a multi-granularity relabeled under-sampling algorithm (MGRU) is proposed. This algorithm fully considers the local information of the data set in the local granularity subspace, and detects the local potential overlapping instances in the data set. Then, the overlapped majority instances are eliminated according to the global relabeled index value, which effectively expands the detection range of Tomek-Links. The simulation results show that when we select the optimal global relabeled index value for under-sampling, the classification accuracy and generalization performance of the proposed under-sampling algorithm are significantly better than other baseline algorithms.
    Captcha Attack:Turning Captchas Against Humanity. (arXiv:2201.04014v1 [cs.CR])
    (2 min) Nowadays, people generate and share massive content on online platforms (e.g., social networks, blogs). In 2021, the 1.9 billion daily active Facebook users posted around 150 thousand photos every minute. Content moderators constantly monitor these online platforms to prevent the spreading of inappropriate content (e.g., hate speech, nudity images). Based on deep learning (DL) advances, Automatic Content Moderators (ACM) help human moderators handle high data volume. Despite their advantages, attackers can exploit weaknesses of DL components (e.g., preprocessing, model) to affect their performance. Therefore, an attacker can leverage such techniques to spread inappropriate content by evading ACM. In this work, we propose CAPtcha Attack (CAPA), an adversarial technique that allows users to spread inappropriate text online by evading ACM controls. CAPA, by generating custom textual CAPTCHAs, exploits ACM's careless design implementations and internal procedures vulnerabilities. We test our attack on real-world ACM, and the results confirm the ferocity of our simple yet effective attack, reaching up to a 100% evasion success in most cases. At the same time, we demonstrate the difficulties in designing CAPA mitigations, opening new challenges in CAPTCHAs research area.
    A multi-scale sampling method for accurate and robust deep neural network to predict combustion chemical kinetics. (arXiv:2201.03549v1 [physics.chem-ph])
    (2 min) Machine learning has long been considered as a black box for predicting combustion chemical kinetics due to the extremely large number of parameters and the lack of evaluation standards and reproducibility. The current work aims to understand two basic questions regarding the deep neural network (DNN) method: what data the DNN needs and how general the DNN method can be. Sampling and preprocessing determine the DNN training dataset, further affect DNN prediction ability. The current work proposes using Box-Cox transformation (BCT) to preprocess the combustion data. In addition, this work compares different sampling methods with or without preprocessing, including the Monte Carlo method, manifold sampling, generative neural network method (cycle-GAN), and newly-proposed multi-scale sampling. Our results reveal that the DNN trained by the manifold data can capture the chemical kinetics in limited configurations but cannot remain robust toward perturbation, which is inevitable for the DNN coupled with the flow field. The Monte Carlo and cycle-GAN samplings can cover a wider phase space but fail to capture small-scale intermediate species, producing poor prediction results. A three-hidden-layer DNN, based on the multi-scale method without specific flame simulation data, allows predicting chemical kinetics in various scenarios and being stable during the temporal evolutions. This single DNN is readily implemented with several CFD codes and validated in various combustors, including (1). zero-dimensional autoignition, (2). one-dimensional freely propagating flame, (3). two-dimensional jet flame with triple-flame structure, and (4). three-dimensional turbulent lifted flames. The results demonstrate the satisfying accuracy and generalization ability of the pre-trained DNN. The Fortran and Python versions of DNN and example code are attached in the supplementary for reproducibility.
    Learning Fair Node Representations with Graph Counterfactual Fairness. (arXiv:2201.03662v1 [cs.LG])
    (2 min) Fair machine learning aims to mitigate the biases of model predictions against certain subpopulations regarding sensitive attributes such as race and gender. Among the many existing fairness notions, counterfactual fairness measures the model fairness from a causal perspective by comparing the predictions of each individual from the original data and the counterfactuals. In counterfactuals, the sensitive attribute values of this individual had been modified. Recently, a few works extend counterfactual fairness to graph data, but most of them neglect the following facts that can lead to biases: 1) the sensitive attributes of each node's neighbors may causally affect the prediction w.r.t. this node; 2) the sensitive attributes may causally affect other features and the graph structure. To tackle these issues, in this paper, we propose a novel fairness notion - graph counterfactual fairness, which considers the biases led by the above facts. To learn node representations towards graph counterfactual fairness, we propose a novel framework based on counterfactual data augmentation. In this framework, we generate counterfactuals corresponding to perturbations on each node's and their neighbors' sensitive attributes. Then we enforce fairness by minimizing the discrepancy between the representations learned from the original graph and the counterfactuals for each node. Experiments on both synthetic and real-world graphs show that our framework outperforms the state-of-the-art baselines in graph counterfactual fairness, and also achieves comparable prediction performance.
    Towards Group Robustness in the presence of Partial Group Labels. (arXiv:2201.03668v1 [cs.LG])
    (2 min) Learning invariant representations is an important requirement when training machine learning models that are driven by spurious correlations in the datasets. These spurious correlations, between input samples and the target labels, wrongly direct the neural network predictions resulting in poor performance on certain groups, especially the minority groups. Robust training against these spurious correlations requires the knowledge of group membership for every sample. Such a requirement is impractical in situations where the data labeling efforts for minority or rare groups are significantly laborious or where the individuals comprising the dataset choose to conceal sensitive information. On the other hand, the presence of such data collection efforts results in datasets that contain partially labeled group information. Recent works have tackled the fully unsupervised scenario where no labels for groups are available. Thus, we aim to fill the missing gap in the literature by tackling a more realistic setting that can leverage partially available sensitive or group information during training. First, we construct a constraint set and derive a high probability bound for the group assignment to belong to the set. Second, we propose an algorithm that optimizes for the worst-off group assignments from the constraint set. Through experiments on image and tabular datasets, we show improvements in the minority group's performance while preserving overall aggregate accuracy across groups.
    SpectraNet: Learned Recognition of Artificial Satellites From High Contrast Spectroscopic Imagery. (arXiv:2201.03614v1 [cs.LG])
    (2 min) Effective space traffic management requires positive identification of artificial satellites. Current methods for extracting object identification from observed data require spatially resolved imagery which limits identification to objects in low earth orbits. Most artificial satellites, however, operate in geostationary orbits at distances which prohibit ground based observatories from resolving spatial information. This paper demonstrates an object identification solution leveraging modified residual convolutional neural networks to map distance-invariant spectroscopic data to object identity. We report classification accuracies exceeding 80% for a simulated 64-class satellite problem--even in the case of satellites undergoing constant, random re-orientation. An astronomical observing campaign driven by these results returned accuracies of 72% for a nine-class problem with an average of 100 examples per class, performing as expected from simulation. We demonstrate the application of variational Bayesian inference by dropout, stochastic weight averaging (SWA), and SWA-focused deep ensembling to measure classification uncertainties--critical components in space traffic management where routine decisions risk expensive space assets and carry geopolitical consequences.
    Learning what to remember. (arXiv:2201.03806v1 [cs.LG])
    (2 min) We consider a lifelong learning scenario in which a learner faces a neverending and arbitrary stream of facts and has to decide which ones to retain in its limited memory. We introduce a mathematical model based on the online learning framework, in which the learner measures itself against a collection of experts that are also memory-constrained and that reflect different policies for what to remember. Interspersed with the stream of facts are occasional questions, and on each of these the learner incurs a loss if it has not remembered the corresponding fact. Its goal is to do almost as well as the best expert in hindsight, while using roughly the same amount of memory. We identify difficulties with using the multiplicative weights update algorithm in this memory-constrained scenario, and design an alternative scheme whose regret guarantees are close to the best possible.
    Reproducing BowNet: Learning Representations by Predicting Bags of Visual Words. (arXiv:2201.03556v1 [cs.CV])
    (2 min) This work aims to reproduce results from the CVPR 2020 paper by Gidaris et al. Self-supervised learning (SSL) is used to learn feature representations of an image using an unlabeled dataset. This work proposes to use bag-of-words (BoW) deep feature descriptors as a self-supervised learning target to learn robust, deep representations. BowNet is trained to reconstruct the histogram of visual words (ie. the deep BoW descriptor) of a reference image when presented a perturbed version of the image as input. Thus, this method aims to learn perturbation-invariant and context-aware image features that can be useful for few-shot tasks or supervised downstream tasks. In the paper, the author describes BowNet as a network consisting of a convolutional feature extractor $\Phi(\cdot)$ and a Dense-softmax layer $\Omega(\cdot)$ trained to predict BoW features from images. After BoW training, the features of $\Phi$ are used in downstream tasks. For this challenge we were trying to build and train a network that could reproduce the CIFAR-100 accuracy improvements reported in the original paper. However, we were unsuccessful in reproducing an accuracy improvement comparable to what the authors mentioned.
    Dictionary Learning with Uniform Sparse Representations for Anomaly Detection. (arXiv:2201.03869v1 [cs.LG])
    (2 min) Many applications like audio and image processing show that sparse representations are a powerful and efficient signal modeling technique. Finding an optimal dictionary that generates at the same time the sparsest representations of data and the smallest approximation error is a hard problem approached by dictionary learning (DL). We study how DL performs in detecting abnormal samples in a dataset of signals. In this paper we use a particular DL formulation that seeks uniform sparse representations model to detect the underlying subspace of the majority of samples in a dataset, using a K-SVD-type algorithm. Numerical simulations show that one can efficiently use this resulted subspace to discriminate the anomalies over the regular data points.
    Active Reinforcement Learning -- A Roadmap Towards Curious Classifier Systems for Self-Adaptation. (arXiv:2201.03947v1 [cs.LG])
    (2 min) Intelligent systems have the ability to improve their behaviour over time taking observations, experiences or explicit feedback into account. Traditional approaches separate the learning problem and make isolated use of techniques from different field of machine learning such as reinforcement learning, active learning, anomaly detection or transfer learning, for instance. In this context, the fundamental reinforcement learning approaches come with several drawbacks that hinder their application to real-world systems: trial-and-error, purely reactive behaviour or isolated problem handling. The idea of this article is to present a concept for alleviating these drawbacks by setting up a research agenda towards what we call "active reinforcement learning" in intelligent systems.
    Iterative RAKI with Complex-Valued Convolution for Improved Image Reconstruction with Limited Scan-Specific Training Samples. (arXiv:2201.03560v1 [eess.IV])
    (2 min) MRI scan time reduction is commonly achieved by Parallel Imaging methods, typically based on uniform undersampling of the inverse image space (a.k.a. k-space) and simultaneous signal reception with multiple receiver coils. The GRAPPA method interpolates missing k-space signals by linear combination of adjacent, acquired signals across all coils, and can be described by a convolution in k-space. Recently, a more generalized method called RAKI was introduced. RAKI is a deep-learning method that generalizes GRAPPA with additional convolution layers, on which a non-linear activation function is applied. This enables non-linear estimation of missing signals by convolutional neural networks. In analogy to GRAPPA, the convolution kernels in RAKI are trained using scan-specific training samples obtained from auto-calibration-signals (ACS). RAKI provides superior reconstruction quality compared to GRAPPA, however, often requires much more ACS due to its increased number of unknown parameters. In order to overcome this limitation, this study investigates the influence of training data on the reconstruction quality for standard 2D imaging, with particular focus on its amount and contrast information. Furthermore, an iterative k-space interpolation approach (iRAKI) is evaluated, which includes training data augmentation via an initial GRAPPA reconstruction, and refinement of convolution filters by iterative training. Using only 18, 20 and 25 ACS lines (8%), iRAKI outperforms RAKI by suppressing residual artefacts occurring at accelerations factors R=4 and R=5, and yields strong noise suppression in comparison to GRAPPA, underlined by quantitative quality metrics. Combination with a phase-constraint yields further improvement. Additionally, iRAKI shows better performance than GRAPPA and RAKI in case of pre-scan calibration and strongly varying contrast between training- and undersampled data.
    Evaluating Bayesian Model Visualisations. (arXiv:2201.03604v1 [cs.HC])
    (2 min) Probabilistic models inform an increasingly broad range of business and policy decisions ultimately made by people. Recent algorithmic, computational, and software framework development progress facilitate the proliferation of Bayesian probabilistic models, which characterise unobserved parameters by their joint distribution instead of point estimates. While they can empower decision makers to explore complex queries and to perform what-if-style conditioning in theory, suitable visualisations and interactive tools are needed to maximise users' comprehension and rational decision making under uncertainty. In this paper, propose a protocol for quantitative evaluation of Bayesian model visualisations and introduce a software framework implementing this protocol to support standardisation in evaluation practice and facilitate reproducibility. We illustrate the evaluation and analysis workflow on a user study that explores whether making Boxplots and Hypothetical Outcome Plots interactive can increase comprehension or rationality and conclude with design guidelines for researchers looking to conduct similar studies in the future.
    Path differentiability of ODE flows. (arXiv:2201.03819v1 [cs.LG])
    (2 min) We consider flows of ordinary differential equations (ODEs) driven by path differentiable vector fields. Path differentiable functions constitute a proper subclass of Lipschitz functions which admit conservative gradients, a notion of generalized derivative compatible with basic calculus rules. Our main result states that such flows inherit the path differentiability property of the driving vector field. We show indeed that forward propagation of derivatives given by the sensitivity differential inclusions provide a conservative Jacobian for the flow. This allows to propose a nonsmooth version of the adjoint method, which can be applied to integral costs under an ODE constraint. This result constitutes a theoretical ground to the application of small step first order methods to solve a broad class of nonsmooth optimization problems with parametrized ODE constraints. This is illustrated with the convergence of small step first order methods based on the proposed nonsmooth adjoint.
    Machine learning enabling high-throughput and remote operations at large-scale user facilities. (arXiv:2201.03550v1 [cs.LG])
    (2 min) Imaging, scattering, and spectroscopy are fundamental in understanding and discovering new functional materials. Contemporary innovations in automation and experimental techniques have led to these measurements being performed much faster and with higher resolution, thus producing vast amounts of data for analysis. These innovations are particularly pronounced at user facilities and synchrotron light sources. Machine learning (ML) methods are regularly developed to process and interpret large datasets in real-time with measurements. However, there remain conceptual barriers to entry for the facility general user community, whom often lack expertise in ML, and technical barriers for deploying ML models. Herein, we demonstrate a variety of archetypal ML models for on-the-fly analysis at multiple beamlines at the National Synchrotron Light Source II (NSLS-II). We describe these examples instructively, with a focus on integrating the models into existing experimental workflows, such that the reader can easily include their own ML techniques into experiments at NSLS-II or facilities with a common infrastructure. The framework presented here shows how with little effort, diverse ML models operate in conjunction with feedback loops via integration into the existing Bluesky Suite for experimental orchestration and data management.
    A Physics-Informed Vector Quantized Autoencoder for Data Compression of Turbulent Flow. (arXiv:2201.03617v1 [physics.flu-dyn])
    (2 min) Analyzing large-scale data from simulations of turbulent flows is memory intensive, requiring significant resources. This major challenge highlights the need for data compression techniques. In this study, we apply a physics-informed Deep Learning technique based on vector quantization to generate a discrete, low-dimensional representation of data from simulations of three-dimensional turbulent flows. The deep learning framework is composed of convolutional layers and incorporates physical constraints on the flow, such as preserving incompressibility and global statistical characteristics of the velocity gradients. The accuracy of the model is assessed using statistical, comparison-based similarity and physics-based metrics. The training data set is produced from Direct Numerical Simulation of an incompressible, statistically stationary, isotropic turbulent flow. The performance of this lossy data compression scheme is evaluated not only with unseen data from the stationary, isotropic turbulent flow, but also with data from decaying isotropic turbulence, and a Taylor-Green vortex flow. Defining the compression ratio (CR) as the ratio of original data size to the compressed one, the results show that our model based on vector quantization can offer CR $=85$ with a mean square error (MSE) of $O(10^{-3})$, and predictions that faithfully reproduce the statistics of the flow, except at the very smallest scales where there is some loss. Compared to the recent study based on a conventional autoencoder where compression is performed in a continuous space, our model improves the CR by more than $30$ percent, and reduces the MSE by an order of magnitude. Our compression model is an attractive solution for situations where fast, high quality and low-overhead encoding and decoding of large data are required.
    Stratified Graph Spectra. (arXiv:2201.03696v1 [cs.LG])
    (2 min) In classic graph signal processing, given a real-valued graph signal, its graph Fourier transform is typically defined as the series of inner products between the signal and each eigenvector of the graph Laplacian. Unfortunately, this definition is not mathematically valid in the cases of vector-valued graph signals which however are typical operands in the state-of-the-art graph learning modeling and analyses. Seeking a generalized transformation decoding the magnitudes of eigencomponents from vector-valued signals is thus the main objective of this paper. Several attempts are explored, and also it is found that performing the transformation at hierarchical levels of adjacency help profile the spectral characteristics of signals more insightfully. The proposed methods are introduced as a new tool assisting on diagnosing and profiling behaviors of graph learning models.
    NFANet: A Novel Method for Weakly Supervised Water Extraction from High-Resolution Remote Sensing Imagery. (arXiv:2201.03686v1 [cs.CV])
    (2 min) The use of deep learning for water extraction requires precise pixel-level labels. However, it is very difficult to label high-resolution remote sensing images at the pixel level. Therefore, we study how to utilize point labels to extract water bodies and propose a novel method called the neighbor feature aggregation network (NFANet). Compared with pixellevel labels, point labels are much easier to obtain, but they will lose much information. In this paper, we take advantage of the similarity between the adjacent pixels of a local water-body, and propose a neighbor sampler to resample remote sensing images. Then, the sampled images are sent to the network for feature aggregation. In addition, we use an improved recursive training algorithm to further improve the extraction accuracy, making the water boundary more natural. Furthermore, our method utilizes neighboring features instead of global or local features to learn more representative features. The experimental results show that the proposed NFANet method not only outperforms other studied weakly supervised approaches, but also obtains similar results as the state-of-the-art ones.
    An Accelerator for Rule Induction in Fuzzy Rough Theory. (arXiv:2201.03649v1 [cs.AI])
    (2 min) Rule-based classifier, that extract a subset of induced rules to efficiently learn/mine while preserving the discernibility information, plays a crucial role in human-explainable artificial intelligence. However, in this era of big data, rule induction on the whole datasets is computationally intensive. So far, to the best of our knowledge, no known method focusing on accelerating rule induction has been reported. This is first study to consider the acceleration technique to reduce the scale of computation in rule induction. We propose an accelerator for rule induction based on fuzzy rough theory; the accelerator can avoid redundant computation and accelerate the building of a rule classifier. First, a rule induction method based on consistence degree, called Consistence-based Value Reduction (CVR), is proposed and used as basis to accelerate. Second, we introduce a compacted search space termed Key Set, which only contains the key instances required to update the induced rule, to conduct value reduction. The monotonicity of Key Set ensures the feasibility of our accelerator. Third, a rule-induction accelerator is designed based on Key Set, and it is theoretically guaranteed to display the same results as the unaccelerated version. Specifically, the rank preservation property of Key Set ensures consistency between the rule induction achieved by the accelerator and the unaccelerated method. Finally, extensive experiments demonstrate that the proposed accelerator can perform remarkably faster than the unaccelerated rule-based classifier methods, especially on datasets with numerous instances.
    Demonstrating The Risk of Imbalanced Datasets in Chest X-ray Image-based Diagnostics by Prototypical Relevance Propagation. (arXiv:2201.03559v1 [eess.IV])
    (2 min) The recent trend of integrating multi-source Chest X-Ray datasets to improve automated diagnostics raises concerns that models learn to exploit source-specific correlations to improve performance by recognizing the source domain of an image rather than the medical pathology. We hypothesize that this effect is enforced by and leverages label-imbalance across the source domains, i.e, prevalence of a disease corresponding to a source. Therefore, in this work, we perform a thorough study of the effect of label-imbalance in multi-source training for the task of pneumonia detection on the widely used ChestX-ray14 and CheXpert datasets. The results highlight and stress the importance of using more faithful and transparent self-explaining models for automated diagnosis, thus enabling the inherent detection of spurious learning. They further illustrate that this undesirable effect of learning spurious correlations can be reduced considerably when ensuring label-balanced source domain datasets.
    Pavlovian Signalling with General Value Functions in Agent-Agent Temporal Decision Making. (arXiv:2201.03709v1 [cs.AI])
    (2 min) In this paper, we contribute a multi-faceted study into Pavlovian signalling -- a process by which learned, temporally extended predictions made by one agent inform decision-making by another agent. Signalling is intimately connected to time and timing. In service of generating and receiving signals, humans and other animals are known to represent time, determine time since past events, predict the time until a future stimulus, and both recognize and generate patterns that unfold in time. We investigate how different temporal processes impact coordination and signalling between learning agents by introducing a partially observable decision-making domain we call the Frost Hollow. In this domain, a prediction learning agent and a reinforcement learning agent are coupled into a two-part decision-making system that works to acquire sparse reward while avoiding time-conditional hazards. We evaluate two domain variations: machine agents interacting in a seven-state linear walk, and human-machine interaction in a virtual-reality environment. Our results showcase the speed of learning for Pavlovian signalling, the impact that different temporal representations do (and do not) have on agent-agent coordination, and how temporal aliasing impacts agent-agent and human-agent interactions differently. As a main contribution, we establish Pavlovian signalling as a natural bridge between fixed signalling paradigms and fully adaptive communication learning between two agents. We further show how to computationally build this adaptive signalling process out of a fixed signalling process, characterized by fast continual prediction learning and minimal constraints on the nature of the agent receiving signals. Our results therefore suggest an actionable, constructivist path towards communication learning between reinforcement learning agents.
  • stat.ML updates on arXiv.org

    Meta-Learning Reliable Priors in the Function Space. (arXiv:2106.03195v2 [cs.LG] UPDATED)
    (2 min) When data are scarce meta-learning can improve a learner's accuracy by harnessing previous experience from related learning tasks. However, existing methods have unreliable uncertainty estimates which are often overconfident. Addressing these shortcomings, we introduce a novel meta-learning framework, called F-PACOH, that treats meta-learned priors as stochastic processes and performs meta-level regularization directly in the function space. This allows us to directly steer the probabilistic predictions of the meta-learner towards high epistemic uncertainty in regions of insufficient meta-training data and, thus, obtain well-calibrated uncertainty estimates. Finally, we showcase how our approach can be integrated with sequential decision making, where reliable uncertainty quantification is imperative. In our benchmark study on meta-learning for Bayesian Optimization (BO), F-PACOH significantly outperforms all other meta-learners and standard baselines.
    GemNet: Universal Directional Graph Neural Networks for Molecules. (arXiv:2106.08903v6 [physics.comp-ph] UPDATED)
    (2 min) Effectively predicting molecular interactions has the potential to accelerate molecular dynamics by multiple orders of magnitude and thus revolutionize chemical simulations. Graph neural networks (GNNs) have recently shown great successes for this task, overtaking classical methods based on fixed molecular kernels. However, they still appear very limited from a theoretical perspective, since regular GNNs cannot distinguish certain types of graphs. In this work we close this gap between theory and practice. We show that GNNs with directed edge embeddings and two-hop message passing are indeed universal approximators for predictions that are invariant to translation, and equivariant to permutation and rotation. We then leverage these insights and multiple structural improvements to propose the geometric message passing neural network (GemNet). We demonstrate the benefits of the proposed changes in multiple ablation studies. GemNet outperforms previous models on the COLL, MD17, and OC20 datasets by 34%, 41%, and 20%, respectively, and performs especially well on the most challenging molecules. Our implementation is available online.
    Robust Generalised Bayesian Inference for Intractable Likelihoods. (arXiv:2104.07359v3 [stat.ME] UPDATED)
    (2 min) Generalised Bayesian inference updates prior beliefs using a loss function, rather than a likelihood, and can therefore be used to confer robustness against possible mis-specification of the likelihood. Here we consider generalised Bayesian inference with a Stein discrepancy as a loss function, motivated by applications in which the likelihood contains an intractable normalisation constant. In this context, the Stein discrepancy circumvents evaluation of the normalisation constant and produces generalised posteriors that are either closed form or accessible using standard Markov chain Monte Carlo. On a theoretical level, we show consistency, asymptotic normality, and bias-robustness of the generalised posterior, highlighting how these properties are impacted by the choice of Stein discrepancy. Then, we provide numerical experiments on a range of intractable distributions, including applications to kernel-based exponential family models and non-Gaussian graphical models.
    Bayesian imaging using Plug & Play priors: when Langevin meets Tweedie. (arXiv:2103.04715v5 [stat.ME] UPDATED)
    (3 min) Since the seminal work of Venkatakrishnan et al. (2013), Plug & Play (PnP) methods have become ubiquitous in Bayesian imaging. These methods derive Minimum Mean Square Error (MMSE) or Maximum A Posteriori (MAP) estimators for inverse problems in imaging by combining an explicit likelihood function with a prior that is implicitly defined by an image denoising algorithm. The PnP algorithms proposed in the literature mainly differ in the iterative schemes they use for optimisation or for sampling. In the case of optimisation schemes, some recent works guarantee the convergence to a fixed point, albeit not necessarily a MAP estimate. In the case of sampling schemes, to the best of our knowledge, there is no known proof of convergence. There also remain important open questions regarding whether the underlying Bayesian models and estimators are well defined, well-posed, and have the basic regularity properties required to support these numerical schemes. To address these limitations, this paper develops theory, methods, and provably convergent algorithms for performing Bayesian inference with PnP priors. We introduce two algorithms: 1) PnP-ULA (Unadjusted Langevin Algorithm) for Monte Carlo sampling and MMSE inference; and 2) PnP-SGD (Stochastic Gradient Descent) for MAP inference. Using recent results on the quantitative convergence of Markov chains, we establish detailed convergence guarantees for these two algorithms under realistic assumptions on the denoising operators used, with special attention to denoisers based on deep neural networks. We also show that these algorithms approximately target a decision-theoretically optimal Bayesian model that is well-posed. The proposed algorithms are demonstrated on several canonical problems such as image deblurring, inpainting, and denoising, where they are used for point estimation as well as for uncertainty visualisation and quantification.
    Optimal Thinning of MCMC Output. (arXiv:2005.03952v5 [stat.ME] UPDATED)
    (2 min) The use of heuristics to assess the convergence and compress the output of Markov chain Monte Carlo can be sub-optimal in terms of the empirical approximations that are produced. Typically a number of the initial states are attributed to "burn in" and removed, whilst the remainder of the chain is "thinned" if compression is also required. In this paper we consider the problem of retrospectively selecting a subset of states, of fixed cardinality, from the sample path such that the approximation provided by their empirical distribution is close to optimal. A novel method is proposed, based on greedy minimisation of a kernel Stein discrepancy, that is suitable for problems where heavy compression is required. Theoretical results guarantee consistency of the method and its effectiveness is demonstrated in the challenging context of parameter inference for ordinary differential equations. Software is available in the Stein Thinning package in Python, R and MATLAB.
    Representation Ensembling for Synergistic Lifelong Learning with Quasilinear Complexity. (arXiv:2004.12908v12 [cs.AI] UPDATED)
    (3 min) In biological learning, data are used to improve performance not only on the current task, but also on previously encountered, and as yet unencountered tasks. In contrast, classical machine learning starts from a blank slate, or tabula rasa, using data only for the single task at hand. While typical transfer learning algorithms can improve performance on future tasks, their performance on prior tasks degrades upon learning new tasks (called catastrophic forgetting). Many recent approaches for continual or lifelong learning have attempted to maintain performance given new tasks. But striving to avoid forgetting sets the goal unnecessarily low: the goal of lifelong learning, whether biological or artificial, should be to improve performance on both past and future tasks with any new data. Our key insight is that we can ensemble representations learned independently across tasks to transfer omnidirectionally, that is, jointly improve performance on both future unforeseen tasks (forward transfer) and past tasks (backward transfer). Specifically, we propose two lifelong learners: one ensembling trees and the other ensembling networks. In both cases, the algorithms are able to learn synergistically, improving performance on both past and future tasks in a variety of simulated and real data scenarios, including tabular data, image data, spoken data, and adversarial tasks. Moreover, they can do so with quasilinear space and time complexity, a requirement for any bona fide lifelong learning system.
    Learning what to remember. (arXiv:2201.03806v1 [cs.LG])
    (2 min) We consider a lifelong learning scenario in which a learner faces a neverending and arbitrary stream of facts and has to decide which ones to retain in its limited memory. We introduce a mathematical model based on the online learning framework, in which the learner measures itself against a collection of experts that are also memory-constrained and that reflect different policies for what to remember. Interspersed with the stream of facts are occasional questions, and on each of these the learner incurs a loss if it has not remembered the corresponding fact. Its goal is to do almost as well as the best expert in hindsight, while using roughly the same amount of memory. We identify difficulties with using the multiplicative weights update algorithm in this memory-constrained scenario, and design an alternative scheme whose regret guarantees are close to the best possible.
    Do We Exploit all Information for Counterfactual Analysis? Benefits of Factor Models and Idiosyncratic Correction. (arXiv:2011.03996v3 [econ.EM] UPDATED)
    (2 min) Optimal pricing, i.e., determining the price level that maximizes profit or revenue of a given product, is a vital task for the retail industry. To select such a quantity, one needs first to estimate the price elasticity from the product demand. Regression methods usually fail to recover such elasticities due to confounding effects and price endogeneity. Therefore, randomized experiments are typically required. However, elasticities can be highly heterogeneous depending on the location of stores, for example. As the randomization frequently occurs at the municipal level, standard difference-in-differences methods may also fail. Possible solutions are based on methodologies to measure the effects of treatments on a single (or just a few) treated unit(s) based on counterfactuals constructed from artificial controls. For example, for each city in the treatment group, a counterfactual may be constructed from the untreated locations. In this paper, we apply a novel high-dimensional statistical method to measure the effects of price changes on daily sales from a major retailer in Brazil. The proposed methodology combines principal components (factors) and sparse regressions, resulting in a method called Factor-Adjusted Regularized Method for Treatment evaluation (\texttt{FarmTreat}). The data consist of daily sales and prices of five different products over more than 400 municipalities. The products considered belong to the \emph{sweet and candies} category and experiments have been conducted over the years of 2016 and 2017. Our results confirm the hypothesis of a high degree of heterogeneity yielding very different pricing strategies over distinct municipalities.
    The Ridgelet Prior: A Covariance Function Approach to Prior Specification for Bayesian Neural Networks. (arXiv:2010.08488v4 [stat.ML] UPDATED)
    (2 min) Bayesian neural networks attempt to combine the strong predictive performance of neural networks with formal quantification of uncertainty associated with the predictive output in the Bayesian framework. However, it remains unclear how to endow the parameters of the network with a prior distribution that is meaningful when lifted into the output space of the network. A possible solution is proposed that enables the user to posit an appropriate Gaussian process covariance function for the task at hand. Our approach constructs a prior distribution for the parameters of the network, called a ridgelet prior, that approximates the posited Gaussian process in the output space of the network. In contrast to existing work on the connection between neural networks and Gaussian processes, our analysis is non-asymptotic, with finite sample-size error bounds provided. This establishes the universality property that a Bayesian neural network can approximate any Gaussian process whose covariance function is sufficiently regular. Our experimental assessment is limited to a proof-of-concept, where we demonstrate that the ridgelet prior can out-perform an unstructured prior on regression problems for which a suitable Gaussian process prior can be provided.
    M2: Mixed Models with Preferences, Popularities and Transitions for Next-Basket Recommendation. (arXiv:2004.01646v3 [cs.LG] UPDATED)
    (2 min) Next-basket recommendation considers the problem of recommending a set of items into the next basket that users will purchase as a whole. In this paper, we develop a novel mixed model with preferences, popularities and transitions (M2) for the next-basket recommendation. This method models three important factors in next-basket generation process: 1) users' general preferences, 2) items' global popularities and 3) transition patterns among items. Unlike existing recurrent neural network-based approaches, M2 does not use the complicated networks to model the transitions among items, or generate embeddings for users. Instead, it has a simple encoder-decoder based approach (ed-Trans) to better model the transition patterns among items. We compared M2 with different combinations of the factors with 5 state-of-the-art next-basket recommendation methods on 4 public benchmark datasets in recommending the first, second and third next basket. Our experimental results demonstrate that M2 significantly outperforms the state-of-the-art methods on all the datasets in all the tasks, with an improvement of up to 22.1%. In addition, our ablation study demonstrates that the ed-Trans is more effective than recurrent neural networks in terms of the recommendation performance. We also have a thorough discussion on various experimental protocols and evaluation metrics for next-basket recommendation evaluation.
    A general class of surrogate functions for stable and efficient reinforcement learning. (arXiv:2108.05828v2 [cs.LG] UPDATED)
    (2 min) Common policy gradient methods rely on the maximization of a sequence of surrogate functions. In recent years, many such surrogate functions have been proposed, most without strong theoretical guarantees, leading to algorithms such as TRPO, PPO or MPO. Rather than design yet another surrogate function, we instead propose a general framework (FMA-PG) based on functional mirror ascent that gives rise to an entire family of surrogate functions. We construct surrogate functions that enable policy improvement guarantees, a property not shared by most existing surrogate functions. Crucially, these guarantees hold regardless of the choice of policy parameterization. Moreover, a particular instantiation of FMA-PG recovers important implementation heuristics (e.g., using forward vs reverse KL divergence) resulting in a variant of TRPO with additional desirable properties. Via experiments on simple bandit problems, we evaluate the algorithms instantiated by FMA-PG. The proposed framework also suggests an improved variant of PPO, whose robustness and efficiency we empirically demonstrate on the MuJoCo suite.
    Competing Mutual Information Constraints with Stochastic Competition-based Activations for Learning Diversified Representations. (arXiv:2201.03624v1 [cs.LG])
    (2 min) This work aims to address the long-established problem of learning diversified representations. To this end, we combine information-theoretic arguments with stochastic competition-based activations, namely Stochastic Local Winner-Takes-All (LWTA) units. In this context, we ditch the conventional deep architectures commonly used in Representation Learning, that rely on non-linear activations; instead, we replace them with sets of locally and stochastically competing linear units. In this setting, each network layer yields sparse outputs, determined by the outcome of the competition between units that are organized into blocks of competitors. We adopt stochastic arguments for the competition mechanism, which perform posterior sampling to determine the winner of each block. We further endow the considered networks with the ability to infer the sub-part of the network that is essential for modeling the data at hand; we impose appropriate stick-breaking priors to this end. To further enrich the information of the emerging representations, we resort to information-theoretic principles, namely the Information Competing Process (ICP). Then, all the components are tied together under the stochastic Variational Bayes framework for inference. We perform a thorough experimental investigation for our approach using benchmark datasets on image classification. As we experimentally show, the resulting networks yield significant discriminative representation learning abilities. In addition, the introduced paradigm allows for a principled investigation mechanism of the emerging intermediate network representations.
    Towards Group Robustness in the presence of Partial Group Labels. (arXiv:2201.03668v1 [cs.LG])
    (2 min) Learning invariant representations is an important requirement when training machine learning models that are driven by spurious correlations in the datasets. These spurious correlations, between input samples and the target labels, wrongly direct the neural network predictions resulting in poor performance on certain groups, especially the minority groups. Robust training against these spurious correlations requires the knowledge of group membership for every sample. Such a requirement is impractical in situations where the data labeling efforts for minority or rare groups are significantly laborious or where the individuals comprising the dataset choose to conceal sensitive information. On the other hand, the presence of such data collection efforts results in datasets that contain partially labeled group information. Recent works have tackled the fully unsupervised scenario where no labels for groups are available. Thus, we aim to fill the missing gap in the literature by tackling a more realistic setting that can leverage partially available sensitive or group information during training. First, we construct a constraint set and derive a high probability bound for the group assignment to belong to the set. Second, we propose an algorithm that optimizes for the worst-off group assignments from the constraint set. Through experiments on image and tabular datasets, we show improvements in the minority group's performance while preserving overall aggregate accuracy across groups.
    Entropic Optimal Transport in Random Graphs. (arXiv:2201.03949v1 [stat.ML])
    (2 min) In graph analysis, a classic task consists in computing similarity measures between (groups of) nodes. In latent space random graphs, nodes are associated to unknown latent variables. One may then seek to compute distances directly in the latent space, using only the graph structure. In this paper, we show that it is possible to consistently estimate entropic-regularized Optimal Transport (OT) distances between groups of nodes in the latent space. We provide a general stability result for entropic OT with respect to perturbations of the cost matrix. We then apply it to several examples of random graphs, such as graphons or $\epsilon$-graphs on manifolds. Along the way, we prove new concentration results for the so-called Universal Singular Value Thresholding estimator, and for the estimation of geodesic distances on a manifold.
    Partial Model Averaging in Federated Learning: Performance Guarantees and Benefits. (arXiv:2201.03789v1 [cs.LG])
    (2 min) Local Stochastic Gradient Descent (SGD) with periodic model averaging (FedAvg) is a foundational algorithm in Federated Learning. The algorithm independently runs SGD on multiple workers and periodically averages the model across all the workers. When local SGD runs with many workers, however, the periodic averaging causes a significant model discrepancy across the workers making the global loss converge slowly. While recent advanced optimization methods tackle the issue focused on non-IID settings, there still exists the model discrepancy issue due to the underlying periodic model averaging. We propose a partial model averaging framework that mitigates the model discrepancy issue in Federated Learning. The partial averaging encourages the local models to stay close to each other on parameter space, and it enables to more effectively minimize the global loss. Given a fixed number of iterations and a large number of workers (128), the partial averaging achieves up to 2.2% higher validation accuracy than the periodic full averaging.
    An Embedding Framework for Consistent Polyhedral Surrogates. (arXiv:1907.07330v2 [cs.LG] UPDATED)
    (2 min) We formalize and study the natural approach of designing convex surrogate loss functions via embeddings, for problems such as classification, ranking, or structured prediction. In this approach, one embeds each of the finitely many predictions (e.g.\ rankings) as a point in $\mathbb{R}^d$, assigns the original loss values to these points, and "convexifies" the loss in some way to obtain a surrogate. We establish a strong connection between this approach and polyhedral (piecewise-linear convex) surrogate losses. Given any polyhedral loss $L$, we give a construction of a link function through which $L$ is a consistent surrogate for the loss it embeds. Conversely, we show how to construct a consistent polyhedral surrogate for any given discrete loss. Our framework yields succinct proofs of consistency or inconsistency of various polyhedral surrogates in the literature, and for inconsistent surrogates, it further reveals the discrete losses for which these surrogates are consistent. We show some additional structure of embeddings, such as the equivalence of embedding and matching Bayes risks, and the equivalence of various notions of non-redudancy. Using these results, we establish that indirect elicitation, a necessary condition for consistency, is also sufficient when working with polyhedral surrogates.
  • Google AI Blog

    Scaling Vision with Sparse Mixture of Experts
    (8 min) Posted by Carlos Riquelme, Research Scientist and Joan Puigcerver, Software Engineer, Google Research, Brain team Advances in deep learning over the last few decades have been driven by a few key elements. With a small number of simple but flexible mechanisms (i.e., inductive biases such as convolutions or sequence attention), increasingly large datasets, and more specialized hardware, neural networks can now achieve impressive results on a wide range of tasks, such as image classification, machine translation, and protein folding prediction. However, the use of large models and datasets comes at the expense of significant computational requirements. Yet, recent works suggest that large model sizes might be necessary for strong generalization and robustness, so training large models whil…
  • Data Science Central

    Atom Computing Tech Perspectives: 2022 Quantum Computing Tech Predictions
    (6 min) Quantum computing gained a lot of attention and headlines in 2021 with companies, like ours, making big bets on this paradigm-shifting technology. While quantum computing may not be a household term just yet, we are going to see big advancements in 2022 on technology development and breadth of involvement. As I’m out talking to customers… Read More »Atom Computing Tech Perspectives: 2022 Quantum Computing Tech Predictions The post Atom Computing Tech Perspectives: 2022 Quantum Computing Tech Predictions appeared first on Data Science Central.
  • The Official NVIDIA Blog

    How Retailers Meet Tough Challenges Using NVIDIA AI
    (4 min) At the National Retail Federation’s annual trade show, conversations tend to touch on recurring themes: “Will we be able to stock must-have products for next Christmas?,” “What incentives can I offer to loyal workers?” and “What happens to my margins if Susie Consumer purchases three of the same dresses online and returns two?” The $26 Read article > The post How Retailers Meet Tough Challenges Using NVIDIA AI  appeared first on The Official NVIDIA Blog.
    AI Startup to Take a Bite Out of Fast-Food Labor Crunch
    (3 min) Addressing a growing labor crisis among quick-service restaurants, startup Vistry is harnessing AI to automate the process of taking orders. The company will share its story at the NRF Big Show, the annual industry gathering of the National Retail Federation in New York, starting Jan. 16. “They’re closing restaurants because there is not enough labor,” Read article > The post AI Startup to Take a Bite Out of Fast-Food Labor Crunch appeared first on The Official NVIDIA Blog.
    GFN Thursday: ‘Fortnite’ Comes to iOS Safari and Android Through NVIDIA GeForce NOW via Closed Beta
    (3 min) Starting next week, Fortnite on GeForce NOW will launch in a limited-time closed beta for mobile, all streamed through the Safari web browser on iOS and the GeForce NOW Android app. The beta is open for registration for all GeForce NOW members, and will help test our server capacity, graphics delivery and new touch controls Read article > The post GFN Thursday: ‘Fortnite’ Comes to iOS Safari and Android Through NVIDIA GeForce NOW via Closed Beta appeared first on The Official NVIDIA Blog.
  • Blog

    Two-Dimensional Tensors in Pytorch
    (8 min) Two-dimensional tensors are analogous to two-dimensional metrics. Like a two-dimensional metric, a two-dimensional tensor also has $n$ number of rows […] The post Two-Dimensional Tensors in Pytorch appeared first on Machine Learning Mastery.
  • Becoming Human: Artificial Intelligence Magazine - Medium

    Do stochastic parrots understand what they recite?
    (6 min) There had never been a time like this before in the history of Artificial Intelligence (AI) where AI was this democratic and celebrated…
    Jump start in object detection with Detectron2 (license plate detection)
    (10 min) A very practical guide to build your first object detection model, jump from rough idea to proof-of-concept in one day
  • John D. Cook

    Self-Orthogonal Latin Squares
    (3 min) The other day I wrote about orthogonal Latin squares. Two Latin squares are orthogonal if the list of pairs of corresponding elements in the two squares contains no repeats. A self-orthogonal Latin square (SOLS) is a Latin square that is orthogonal to its transpose. Here’s an example of a self-orthogonal Latin square: 1 7 6 […] Self-Orthogonal Latin Squares first appeared on John D. Cook.
    Better parts, worse system
    (2 min) There’s a rule of systems thinking that improving part of a system can often make the system as a whole worse. One example of this is Braess’ Paradox that adding roads can make traffic worse, and closing roads can improve traffic. Another example is the paradox of enrichment: Increasing the food available to an ecosystem […] Better parts, worse system first appeared on John D. Cook.
  • Artificial Intelligence

    Software for generating new voices; Not voice cloning.
    (1 min) There's plenty of software out there that can clone another person's voice. And there's even some software that acts as a voice changer, mapping another person's voice to your performance. What I would like is voice changing software that allows you to edit and save a fully custom voice. It could be used for personal animation and video game projects. ​ There is the concern of how this might effect the careers of people in the voice acting business. . . But I figure that making a professional acting performance would still be a valuable service. And either way, I find it might be equally questionable to restrict the advancement of technology to preserve anyone's monopoly. submitted by /u/BladeManEXE7 [link] [comments]
    I'm investigating the impact of the implementation of AI in the rail industry. I would appreciate if you could answer a simple survey.
    (1 min) Hello everyone, I would appreciate if you could answer this survey. It is anonymous and takes about 3 minutes. LINK: https://docs.google.com/forms/d/1HxsVggCMTXtKGP3ov9lK3C6gyS9faN9Q0EtwFYqEwqY Thank you! submitted by /u/Adventurous_Wall6596 [link] [comments]
    whats the best text i could record for a ai TTS bot.
    (0 min) i want to make a realist tts bot of my voice and i just need some good text to read out and record for my ai. i just dont know what should i record. any tips? submitted by /u/Born-Macaroon-7610 [link] [comments]
    discord ai stock bot with auto-moderation and headline analysis- [buni]x, the based universal natty instrument x =]
    (0 min) submitted by /u/ungKoda [link] [comments]
    Equations for computing true positives and false positives when using object detection algorithms?
    (1 min) I am running some evaluation metrics using the YOLOv5 object detection algorithm, and wish to calculate my true positives and false positives. For instance, the evaluation metric outputs are as follows: Class Images Labels Prec Recall mAP@.5 mAP@.5:.95: all 100 36 0.444 0.702 0.481 0.223 Class 1 50 29 0.588 0.689 0.668 0.333 Class 2 50 7 0.301 0.714 0.293 0.113 Looking at this source (https://github.com/ultralytics/yolov5/issues/5713), I found that you could calculate the true positives and false positives with the following equations: ``` Computed for Class 1 TP = Recall * Labels = 34.45 ≈ 34 FP = (TP / Precision) - TP = 23.82 ≈ 24 ``` I am new to evaluation metrics, so at first glance, I'm thinking that the false positive number is fairly high. Is this the correct formula to compute the true positives and false positives? I'm just looking for some verification and some explanation as to why this works, if it does. submitted by /u/EnvironmentalLemon36 [link] [comments]
    Top Emerging Deep Learning Trends For 2022
    (0 min) submitted by /u/techsucker [link] [comments]
    [R] Facebook AI & UC Berkeley’s ConvNeXts Compete Favourably With SOTA Hierarchical ViTs on CV Benchmarks
    (1 min) A team from Facebook AI Research and UC Berkeley proposes ConvNeXts, a pure ConvNet model that achieves performance comparable with state-of-the-art hierarchical vision transformers on computer vision benchmarks while retaining the simplicity and efficiency of standard ConvNets. Here is a quick read: Facebook AI & UC Berkeley’s ConvNeXts Compete Favourably With SOTA Hierarchical ViTs on CV Benchmarks. The ConvNeXt code is available on the project’s GitHub. The paper A ConvNet for the 2020s is on arXiv. submitted by /u/Yuqing7 [link] [comments]
    Top 10 Face Detection APIs
    (0 min) submitted by /u/tah_zem [link] [comments]
    AI did some good work for my prompt "Gothic Dream"
    (1 min) submitted by /u/snowpixelapp [link] [comments]
    The age of AI-ism
    (0 min) submitted by /u/bendee983 [link] [comments]
    Peril of Human-Like Artificial Intelligence
    (1 min) arXiv:2201.04200 Date: Tue, 11 Jan 2022 21:07:17 GMT (437kb) Title: The Turing Trap: The Promise & Peril of Human-Like Artificial Intelligence Authors: Erik Brynjolfsson Categories: econ.GN cs.AI cs.CY cs.LG q-fin.EC Comments: Forthcoming in Daedalus, April 2022. Posted with permission MSC-class: 91 ACM-class: K.4.0; K.4.2 \\ In 1950, Alan Turing proposed an imitation game as the ultimate test of whether a machine was intelligent: could a machine imitate a human so well that its answers to questions indistinguishable from a human. Ever since, creating intelligence that matches human intelligence has implicitly or explicitly been the goal of thousands of researchers, engineers, and entrepreneurs. The benefits of human-like artificial intelligence (HLAI) include soaring productivity, increased leisure, and perhaps most profoundly, a better understanding of our own minds. But not all types of AI are human-like. In fact, many of the most powerful systems are very different from humans. So an excessive focus on developing and deploying HLAI can lead us into a trap. As machines become better substitutes for human labor, workers lose economic and political bargaining power and become increasingly dependent on those who control the technology. In contrast, when AI is focused on augmenting humans rather than mimicking them, then humans retain the power to insist on a share of the value created. Furthermore, augmentation creates new capabilities and new products and services, ultimately generating far more value than merely human-like AI. While both types of AI can be enormously beneficial, there are currently excess incentives for automation rather than augmentation among technologists, business executives, and policymakers. \\ ( https://arxiv.org/abs/2201.04200 , 437kb) submitted by /u/kg4jxt [link] [comments]
    A curated list of Speech/NLU SaaS companies
    (1 min) Hi, I'm the maker of SpeechPro, a niche job board for Speech AI tech professionals. I've just added a page listing all the Speech/NLU SaaS companies. SpeechPro company list screenshot It's a growing list and I'll keep updating it. Right now there are 50 and most of them are ASR/TTS/NLU/CAI related. Considering that ASR and TTS are essential parts for the Human-Machine interface, Speech SaaS companies might gain more popularity in this #Metaverse trend. Hope this company list can be helpful. Thanks. submitted by /u/david_swagger [link] [comments]
    Solving CAPTCHAs With Machine Learning to Enable Dark Web Research
    (1 min) submitted by /u/DaveBowman1975 [link] [comments]
  • Machine Learning

    [N] If you are interested in graph representation learning with Graph Neural Networks, then read on
    (1 min) Hi r/machinelearning, For the past 1.5 years I have been organizing an online journal club on the topic of Graph Representation Learning. We meet either weekly or fortnightly via Zoom to discuss a relevant paper. We are a small and friendly group and we would like to invite others who have similar interests to join us. We meet on Thursdays, 6:00pm-7:30pm, Canada/Pacific timezone, and out next meeting is on January 20, 2022. You are welcome to join us here. Cheers! submitted by /u/YodaML [link] [comments]
    [P] Colab Notebook to create AI-Generated Pokémon in 2 clicks
    (1 min) https://colab.research.google.com/drive/1A3t2gQofQGeXo5z1BAr1zqYaqVg3czKd?usp=sharing Last month I did an experiment with AI-Generated Pokémon on a fine trained ruDALL-E that went unexpectedly viral. Today, I've open-sourced the finetuned model and released a Colab Notebook that lets you generate Pokemon with just two clicks. Also, I added an infinite generation feature which works surprisingly well! submitted by /u/minimaxir [link] [comments]
    Trouble Modelling Heavy-Tailed Distributions [Research]
    (2 min) I'm currently working a problem where I'm attempting to reconstruct 1x600 long Fourier vectors using an Autoencoder, where each element in the vector is a frequency (increasing sequentially). I have a set of 17,000 spectra which I apply Numpy's rfft to obtain the corresponding Fourier vector. After transforming all of my data, I use an sklearn StandardScaler() to Standardize each column (frequency). From here, I am training my NN on this Standardized data. I have created plots of two samples from my data for illustration purposes. ​ https://preview.redd.it/w357enqdhib81.png?width=1526&format=png&auto=webp&s=b34d65ada9d241614b22944b2648cc60845f801c ​ https://preview.redd.it/96d3i8cfhib81.png?width=1526&format=png&auto=webp&s=1f89c75902685cb500271b2235fc382e584e0d09 You can see that in …
    [D] Solo machine learning engineer woes
    (2 min) I work at a start-up as the sole machine learning engineer, and I have been tasked to develop a state-of-the-art model on a domain specific problem. Some information on the task: Computer vision problem - complex keypoint localisation / 3D mesh prediction No training data - I'm proposing that I build a synthetic data pipeline supplemented with real-world data. I feel like there is too much for me to be doing at once: reading papers, writing plans and documentation, implementing code and training models, reporting to management, etc. My boss understands that research takes a long time. Nevertheless, I feel like management does not fully grasp that results on this are not certain or even likely. Even though the work environment is good, I am not handling the uncertainty around results well mentally. I wanted to ask, in your experience, what is a typical team size for this kind of job? And do you have any advice? Thanks submitted by /u/Hgat [link] [comments]
    [D] Portals for outsourcing preliminary data labeling
    (1 min) I am currently working on building a machine learning model, which will be used to automatically label tweets with regards to some prespecified labels (a total of four). Together with a team we are looking for an online tool which we could use to outsource preliminary labeling for later training of the model. So far we have been using a program built specifically for this task by one of the co-researchers, but now we want to switch to something more mainstream. The task we want to outsource goes as follows: participants will be given the text of the tweet and they will have to label it according to prespecified labels. Simple as that. Since the tweets are in Polish we are not interested in any additional features, which would be specific to the English population. We are currently considering the two following platforms: https://prodi.gy/demo https://universaldatatool.com/app/ Did any of you take part in a similar study and/or have some other experience with using the above tools? Are there any other tools that could work for the task above that you could recommend? submitted by /u/Hub_Pli [link] [comments]
    [D] What has been your experience as a machine learning manager?
    (3 min) As the title says, I am curious to learn more about what this sub’s experience has been working with/as management. As context, the company I work for (series b startup) is asking me to take over as the machine learning team manager. While I’m flattered that they see me as a “rising star”, I am a bit concerned about moving away from the actual building, testing, and deployment of machine learning models. I have been an ML engineer for about a year and prior to that I worked as a DS at large tech firm for 2 years and have managed multiple teams prior to that as well. My questions for the sub are as follows: Does moving into a management role take me out of contention for more IC roles in the broader job market? Or do recruiters not care as much about that? I recognize that I will be doing less code development as manager, but just how much of a drop off should I expect? ML Managers of Reddit, how do you keep pace with the technical developments in the field without actually writing and submitting code? Thanks to everyone in advance! submitted by /u/frank_sobotka_ [link] [comments]
    [N] EasySynth - Unreal Engine plugin for easy creation of synthetic images (depth maps, optical flow, semantics, ...)
    (2 min) Hello Community! We needed a user-friendly image dataset creation tool for Machine Learning and Computer Vision purposes, but since all we could find were advanced simulators, we decided to create an open source one in case anybody else found it useful. EasySynth is an easy-to-use UnrealEngine plugin which enables simple generation of ground truth depth images, normal images, optical flow images and semantic images. EasySynth does not require knowledge of either C++ or Blueprints. It utilizes a LevelSequence (checkout the video using the link below) to define the movement of the camera and provides a simple interface for semantic labeling of actors present in the scene. It supports exporting camera positions and rotations at each frame, as well as the following output formats: Color images rendered by default Depth grayscale images Pixel normal images Optical flow between frames Images with actor semantic labels As an example, the following output can be created in 20 minutes, from the project setup to the output rendering (using a third-party level). Check out the workflow timelapse. Provides GT data for training depth estimation, normal estimation, semantic or optical flow models For more details check out the GitHub repo. We are working on making this plugin available for free in the Unreal Engine plugin marketplace. The current version works with UE 4.27. We hope somebody will find it useful! submitted by /u/Ok_Compote_3050 [link] [comments]
    [D] Is there value in adding comments to large ML projects?
    (1 min) So I've started working with yolov5 which has literally 0 comments. If I went and added comments to the code, would it Be appreciated Actually be accepted into the main branch If it would never be accepted I don't really want to spend the time tbh. submitted by /u/a_slay_nub [link] [comments]
    [D] How to do self-research?
    (3 min) Hai! I want to gain research experience in Deep Learning (preferably NLP). How should I go about doing self-research? Working as an RA is one way but most of the professors I've contacted are either not accepting people outside their universities or want to me do full-time which I cannot. I want to build my research profile which will be a valuable asset my grad school application also. Please help me figuring out ways to do research and build top notch projects. submitted by /u/ICantStopMe- [link] [comments]
    [P] Audio data augmentation techniques
    (1 min) Audio data augmentation is a great solution to make your audio AI models more robust. In my new video in the "Audio Data Augmentation" series, you can learn about augmentation techniques both in the raw audio and spectrogram domains. Specifically, you’ll learn about: 📌 Time shifting 📌 Time stretching 📌 Pitch scaling 📌 Noise addition 📌 Impulse response addition 📌 Filters 📌 Polarity Inversion 📌 Random gain 📌 Time masking 📌 Frequency masking Check out the video: https://www.youtube.com/watch?v=bm1cQfb_pLA&list=PL-wATfeyAMNoR4aqS-Fv0GRmS6bx5RtTW&index=3 submitted by /u/diabulusInMusica [link] [comments]
    [Research] Talks about Bayesian Optimization
    (1 min) [Research] Given the previous post I had (https://www.reddit.com/r/MachineLearning/comments/rhppgq/research_new_library_for_bayesian_optimisation/hot7x74/?context=3), I realised a lot are not sure what BO is. I put out a new playlist with two (short) videos for now about BO and its importance as promised: Video 1: https://www.youtube.com/watch?v=YwFiB7vOQD8&t=4s Video 2: https://www.youtube.com/watch?v=85dslFi7IB8 ​ I will increase the videos over time and detail how BO actually works. The first two-three videos are there to serve as examples conveying the class of problems this field covers. ​ If you are interested, feel free to check them out! View Poll submitted by /u/Ok_Can2425 [link] [comments]
    [D] Thoughts on IEEE Access as a journal?
    (3 min) I’m doing my PhD and tempted to publish my latest work there due to the faster review process when compared to other journals. My supervisor is encouraging me to publish there and get started on the next piece of work as soon as possible. He likes to get his PhD students finished quickly via article based phds and says that Access counts as much as any of the other mainstream journals (towards achieving the goal of article based phd). I want to get involved in industry research as soon as possible once graduating and am wondering if publishing in access will diminish my prospects vs publishing in the likes of NN or something like that. submitted by /u/DaBeastGeek [link] [comments]
    [D] Does PyTorch credit contributors?
    (1 min) Asked them on two occasions, no response. My PRs were closed and "landed" by a dev, so no standard Github contributor credit. If I were to contribute an optimizer, they'd merge it without attribution? submitted by /u/OverLordGoldDragon [link] [comments]
    [D] What is state of the art for audio generation?
    (1 min) I’m looking for suggestions for short sound effects, music, and / or speech but primarily concerned about sound effects. Is WaveGAN still dominant? submitted by /u/gameml [link] [comments]
  • Neural Networks, Deep Learning and Machine Learning

    Novel Viewpoint Tennis from Single Camera
    (0 min) submitted by /u/Odd_Temporary_9736 [link] [comments]
  • Reinforcement Learning

    Tips for remembering DRL algorithms for newbies
    (1 min) Hi all, I am studying the most popular DRL algorithms (NFQ, DQN, DDQN, Dueling DDQN, Reinforce, VPG, A3C, A2C, DDPG, TD3, SAC, PPO) and I was wondering a couple of things. A) What is the level of detail that you need to remember in order to use those algs efficiently? In other words, do you need to remember every bit of them? B) Do you have tricks and tips to remember them more easily, maybe by pointing me at some maps that highlight connections between them Thanks! submitted by /u/No_Possibility_7588 [link] [comments]
    Neural networks in Unity 3D
    (1 min) For a school project I need to create an agent that learns how to navigate a environment using reinforcement learning (DQN, policy gradient). To do this I need to be able to create a neural network which I will need to train. My question now is if there are any Neural Network libraries that I can use for creating and training neural networks. I know unity has the ML-agents toolkit but this does not seem to fit my need. Thanks! submitted by /u/IssueRepresentative5 [link] [comments]
    what is the best approach to POMDP environment?
    (2 min) Hello, I have some questions about the POMDP environment. First, I thought that in a POMDP environment, a policy-based method would be better than a value-based method. For example, Alice Grid World. Is it generally correct? Second, when training a limited view agent in a tabular environment, I expected the rppo agent to perform better than cnn-based ppo. But it didn't. I used this repository that was already implemented and saw slow learning based on this. When I trained a Starcraft II agent, there are really huge differences between those architecture. So I just wonder your opinions. Very Thanks! submitted by /u/Spiritual_Fig3632 [link] [comments]
    issue with modifying mujoco environment
    (1 min) I am trying to modify the inverteddoublependulum environment in mujoco and tried to changed the xml parameters in the gym envs folder, however when i load the environment in my script, it still loads back the same un-modified environment. I am creating a closed-loop kinematic chain by adding another cart with poles connecting to the original cart. Am i missing a step such as using some wrapper or having to register the environment? Thanks, its first first time doing this. submitted by /u/Lifeisunfair210 [link] [comments]
    Why do they call it 'greedy with respect to V'?
    (3 min) Hi everyone I just started reading 'Reinforcement Learning' by Richard S. Sutton and Andrew G. Barto and I hit a point where I'm not really sure why they describe it this way. (And it's not just this book, I see this expression everywhere) In chapter 3.6 they compute the optimal state value function by the following equation. Bellman optimality equation for state value It says afterwards how to choose the best policy with the optimal state value function If you have the optimal value function, v* then the actions that appear best after a one-step search will be optimal actions. Another way of saying this is that any policy that is greedy with respect to the optimal evaluation function v\* is an optimal policy. In chapter 4.2 where they describe policy improvement, they give us the following equation for updating the policy based on v_pi. Policy update based on v_pi The greedy policy takes the action that looks best in the short term—after one step of lookahead—according to v_pi ... The process of making a new policy that improves on an original policy, by making it greedy with respect to the value function of the original policy, is called policy improvement. To me, when you say "policy that is greedy with respect to v", it sounds like you just simply choose the action that moves the agent to the next state with the highest state value. But this cannot be true because you have to choose the action that maximizes the following function of v, not simply v itself. Function of v you wanna maximize And the equation above is q_pi(s, a). So shouldn't the correct phrase be "greedy with respect to q" and not v? I am wondering if I'm just overthinking things at this point... submitted by /u/peterpan1998 [link] [comments]
    Did anyone try Panda-Gym?
    (1 min) The acquisition of Mujoco makes Openai to remove the robotics from their repo. I had no choice but to find an alternative. Then I found https://github.com/qgallouedec/panda-gym which is built on PyBullet. submitted by /u/chezhek [link] [comments]

2022-01-12

  • cs.LG updates on arXiv.org

    AdvSim: Generating Safety-Critical Scenarios for Self-Driving Vehicles. (arXiv:2101.06549v3 [cs.RO] UPDATED)
    (2 min) As self-driving systems become better, simulating scenarios where the autonomy stack may fail becomes more important. Traditionally, those scenarios are generated for a few scenes with respect to the planning module that takes ground-truth actor states as input. This does not scale and cannot identify all possible autonomy failures, such as perception failures due to occlusion. In this paper, we propose AdvSim, an adversarial framework to generate safety-critical scenarios for any LiDAR-based autonomy system. Given an initial traffic scenario, AdvSim modifies the actors' trajectories in a physically plausible manner and updates the LiDAR sensor data to match the perturbed world. Importantly, by simulating directly from sensor data, we obtain adversarial scenarios that are safety-critical for the full autonomy stack. Our experiments show that our approach is general and can identify thousands of semantically meaningful safety-critical scenarios for a wide range of modern self-driving systems. Furthermore, we show that the robustness and safety of these systems can be further improved by training them with scenarios generated by AdvSim.
    Logically Sound Arguments for the Effectiveness of ML Safety Measures. (arXiv:2111.02649v2 [cs.LO] UPDATED)
    (0 min) We investigate the issues of achieving sufficient rigor in the arguments for the safety of machine learning functions. By considering the known weaknesses of DNN-based 2D bounding box detection algorithms, we sharpen the metric of imprecise pedestrian localization by associating it with the safety goal. The sharpening leads to introducing a conservative post-processor after the standard non-max-suppression as a counter-measure. We then propose a semi-formal assurance case for arguing the effectiveness of the post-processor, which is further translated into formal proof obligations for demonstrating the soundness of the arguments. Applying theorem proving not only discovers the need to introduce missing claims and mathematical concepts but also reveals the limitation of Dempster-Shafer's rules used in semi-formal argumentation.
    Cross-view Self-Supervised Learning on Heterogeneous Graph Neural Network via Bootstrapping. (arXiv:2201.03340v1 [cs.LG])
    (2 min) Heterogeneous graph neural networks can represent information of heterogeneous graphs with excellent ability. Recently, self-supervised learning manner is researched which learns the unique expression of a graph through a contrastive learning method. In the absence of labels, this learning methods show great potential. However, contrastive learning relies heavily on positive and negative pairs, and generating high-quality pairs from heterogeneous graphs is difficult. In this paper, in line with recent innovations in self-supervised learning, we introduce a that can generate good representations without generating large number of pairs. In addition, paying attention to the fact that heterogeneous graphs can be viewed from two perspectives in this process, high-level expressions in the graphs are captured and expressed. The proposed model showed state-of-the-art performance than other methods in various real world datasets.
    A Survey of Uncertainty in Deep Neural Networks. (arXiv:2107.03342v2 [cs.LG] UPDATED)
    (0 min) Due to their increasing spread, confidence in neural network predictions became more and more important. However, basic neural networks do not deliver certainty estimates or suffer from over or under confidence. Many researchers have been working on understanding and quantifying uncertainty in a neural network's prediction. As a result, different types and sources of uncertainty have been identified and a variety of approaches to measure and quantify uncertainty in neural networks have been proposed. This work gives a comprehensive overview of uncertainty estimation in neural networks, reviews recent advances in the field, highlights current challenges, and identifies potential research opportunities. It is intended to give anyone interested in uncertainty estimation in neural networks a broad overview and introduction, without presupposing prior knowledge in this field. A comprehensive introduction to the most crucial sources of uncertainty is given and their separation into reducible model uncertainty and not reducible data uncertainty is presented. The modeling of these uncertainties based on deterministic neural networks, Bayesian neural networks, ensemble of neural networks, and test-time data augmentation approaches is introduced and different branches of these fields as well as the latest developments are discussed. For a practical application, we discuss different measures of uncertainty, approaches for the calibration of neural networks and give an overview of existing baselines and implementations. Different examples from the wide spectrum of challenges in different fields give an idea of the needs and challenges regarding uncertainties in practical applications. Additionally, the practical limitations of current methods for mission- and safety-critical real world applications are discussed and an outlook on the next steps towards a broader usage of such methods is given.
    A Tutorial on Learning With Bayesian Networks. (arXiv:2002.00269v3 [cs.LG] UPDATED)
    (0 min) A Bayesian network is a graphical model that encodes probabilistic relationships among variables of interest. When used in conjunction with statistical techniques, the graphical model has several advantages for data analysis. One, because the model encodes dependencies among all variables, it readily handles situations where some data entries are missing. Two, a Bayesian network can be used to learn causal relationships, and hence can be used to gain understanding about a problem domain and to predict the consequences of intervention. Three, because the model has both a causal and probabilistic semantics, it is an ideal representation for combining prior knowledge (which often comes in causal form) and data. Four, Bayesian statistical methods in conjunction with Bayesian networks offer an efficient and principled approach for avoiding the overfitting of data. In this paper, we discuss methods for constructing Bayesian networks from prior knowledge and summarize Bayesian statistical methods for using data to improve these models. With regard to the latter task, we describe methods for learning both the parameters and structure of a Bayesian network, including techniques for learning with incomplete data. In addition, we relate Bayesian-network methods for learning to techniques for supervised and unsupervised learning. We illustrate the graphical-modeling approach using a real-world case study.
    AIDA: An Active Inference-based Design Agent for Audio Processing Algorithms. (arXiv:2112.13366v2 [eess.AS] UPDATED)
    (0 min) In this paper we present AIDA, which is an active inference-based agent that iteratively designs a personalized audio processing algorithm through situated interactions with a human client. The target application of AIDA is to propose on-the-spot the most interesting alternative values for the tuning parameters of a hearing aid (HA) algorithm, whenever a HA client is not satisfied with their HA performance. AIDA interprets searching for the "most interesting alternative" as an issue of optimal (acoustic) context-aware Bayesian trial design. In computational terms, AIDA is realized as an active inference-based agent with an Expected Free Energy criterion for trial design. This type of architecture is inspired by neuro-economic models on efficient (Bayesian) trial design in brains and implies that AIDA comprises generative probabilistic models for acoustic signals and user responses. We propose a novel generative model for acoustic signals as a sum of time-varying auto-regressive filters and a user response model based on a Gaussian Process Classifier. The full AIDA agent has been implemented in a factor graph for the generative model and all tasks (parameter learning, acoustic context classification, trial design, etc.) are realized by variational message passing on the factor graph. All verification and validation experiments and demonstrations are freely accessible at our GitHub repository.
    Learning to Simulate Self-Driven Particles System with Coordinated Policy Optimization. (arXiv:2110.13827v2 [cs.LG] UPDATED)
    (0 min) Self-Driven Particles (SDP) describe a category of multi-agent systems common in everyday life, such as flocking birds and traffic flows. In a SDP system, each agent pursues its own goal and constantly changes its cooperative or competitive behaviors with its nearby agents. Manually designing the controllers for such SDP system is time-consuming, while the resulting emergent behaviors are often not realistic nor generalizable. Thus the realistic simulation of SDP systems remains challenging. Reinforcement learning provides an appealing alternative for automating the development of the controller for SDP. However, previous multi-agent reinforcement learning (MARL) methods define the agents to be teammates or enemies before hand, which fail to capture the essence of SDP where the role of each agent varies to be cooperative or competitive even within one episode. To simulate SDP with MARL, a key challenge is to coordinate agents' behaviors while still maximizing individual objectives. Taking traffic simulation as the testing bed, in this work we develop a novel MARL method called Coordinated Policy Optimization (CoPO), which incorporates social psychology principle to learn neural controller for SDP. Experiments show that the proposed method can achieve superior performance compared to MARL baselines in various metrics. Noticeably the trained vehicles exhibit complex and diverse social behaviors that improve performance and safety of the population as a whole. Demo video and source code are available at: https://decisionforce.github.io/CoPO/
    Communication-Efficient Federated Learning via Predictive Coding. (arXiv:2108.00918v2 [cs.DC] UPDATED)
    (0 min) Federated learning can enable remote workers to collaboratively train a shared machine learning model while allowing training data to be kept locally. In the use case of wireless mobile devices, the communication overhead is a critical bottleneck due to limited power and bandwidth. Prior work has utilized various data compression tools such as quantization and sparsification to reduce the overhead. In this paper, we propose a predictive coding based compression scheme for federated learning. The scheme has shared prediction functions among all devices and allows each worker to transmit a compressed residual vector derived from the reference. In each communication round, we select the predictor and quantizer based on the rate-distortion cost, and further reduce the redundancy with entropy coding. Extensive simulations reveal that the communication cost can be reduced up to 99% with even better learning performance when compared with other baseline methods.
    L\'evy Induced Stochastic Differential Equation Equipped with Neural Network for Time Series Forecasting. (arXiv:2111.13164v2 [cs.LG] UPDATED)
    (0 min) With the fast development of modern deep learning techniques, the study of dynamic systems and neural networks is increasingly benefiting each other in a lot of different ways. Since uncertainties often arise in real world observations, SDEs (stochastic differential equations) come to play an important role. To be more specific, in this paper, we use a collection of SDEs equipped with neural networks to predict long-term trend of noisy time series which has big jump properties and high probability distribution shift. Our contributions are, first, we explored SDEs driven by $\alpha$-stable L\'evy motion to model the time series data and solved the problem through neural network approximation. Second, we theoretically proved the convergence of the model and obtained the convergence rate. Finally, we illustrated our method by applying it to stock marketing time series prediction and found the convergence order of error.
    Neural-PDE: A RNN based neural network for solving time dependent PDEs. (arXiv:2009.03892v3 [math.NA] UPDATED)
    (0 min) Partial differential equations (PDEs) play a crucial role in studying a vast number of problems in science and engineering. Numerically solving nonlinear and/or high-dimensional PDEs is often a challenging task. Inspired by the traditional finite difference and finite elements methods and emerging advancements in machine learning, we propose a sequence deep learning framework called Neural-PDE, which allows to automatically learn governing rules of any time-dependent PDE system from existing data by using a bidirectional LSTM encoder, and predict the next n time steps data. One critical feature of our proposed framework is that the Neural-PDE is able to simultaneously learn and simulate the multiscale variables.We test the Neural-PDE by a range of examples from one-dimensional PDEs to a high-dimensional and nonlinear complex fluids model. The results show that the Neural-PDE is capable of learning the initial conditions, boundary conditions and differential operators without the knowledge of the specific form of a PDE system.In our experiments the Neural-PDE can efficiently extract the dynamics within 20 epochs training, and produces accurate predictions. Furthermore, unlike the traditional machine learning approaches in learning PDE such as CNN and MLP which require vast parameters for model precision, Neural-PDE shares parameters across all time steps, thus considerably reduces the computational complexity and leads to a fast learning algorithm.
    A semigroup method for high dimensional elliptic PDEs and eigenvalue problems based on neural networks. (arXiv:2105.03480v3 [math.NA] UPDATED)
    (0 min) In this paper, we propose a semigroup method for solving high-dimensional elliptic partial differential equations (PDEs) and the associated eigenvalue problems based on neural networks. For the PDE problems, we reformulate the original equations as variational problems with the help of semigroup operators and then solve the variational problems with neural network (NN) parameterization. The main advantages are that no mixed second-order derivative computation is needed during the stochastic gradient descent training and that the boundary conditions are taken into account automatically by the semigroup operator. Unlike popular methods like PINN \cite{raissi2019physics} and Deep Ritz \cite{weinan2018deep} where the Dirichlet boundary condition is enforced solely through penalty functions and thus changes the true solution, the proposed method is able to address the boundary conditions without penalty functions and it gives the correct true solution even when penalty functions are added, thanks to the semigroup operator. For eigenvalue problems, a primal-dual method is proposed, efficiently resolving the constraint with a simple scalar dual variable and resulting in a faster algorithm compared with the BSDE solver \cite{han2020solving} in certain problems such as the eigenvalue problem associated with the linear Schr\"odinger operator. Numerical results are provided to demonstrate the performance of the proposed methods.
    MARINA: Faster Non-Convex Distributed Learning with Compression. (arXiv:2102.07845v3 [cs.LG] UPDATED)
    (0 min) We develop and analyze MARINA: a new communication efficient method for non-convex distributed learning over heterogeneous datasets. MARINA employs a novel communication compression strategy based on the compression of gradient differences that is reminiscent of but different from the strategy employed in the DIANA method of Mishchenko et al. (2019). Unlike virtually all competing distributed first-order methods, including DIANA, ours is based on a carefully designed biased gradient estimator, which is the key to its superior theoretical and practical performance. The communication complexity bounds we prove for MARINA are evidently better than those of all previous first-order methods. Further, we develop and analyze two variants of MARINA: VR-MARINA and PP-MARINA. The first method is designed for the case when the local loss functions owned by clients are either of a finite sum or of an expectation form, and the second method allows for a partial participation of clients -- a feature important in federated learning. All our methods are superior to previous state-of-the-art methods in terms of oracle/communication complexity. Finally, we provide a convergence analysis of all methods for problems satisfying the Polyak-Lojasiewicz condition.
    Swin Transformer for Fast MRI. (arXiv:2201.03230v1 [eess.IV])
    (0 min) Magnetic resonance imaging (MRI) is an important non-invasive clinical tool that can produce high-resolution and reproducible images. However, a long scanning time is required for high-quality MR images, which leads to exhaustion and discomfort of patients, inducing more artefacts due to voluntary movements of the patients and involuntary physiological movements. To accelerate the scanning process, methods by k-space undersampling and deep learning based reconstruction have been popularised. This work introduced SwinMR, a novel Swin transformer based method for fast MRI reconstruction. The whole network consisted of an input module (IM), a feature extraction module (FEM) and an output module (OM). The IM and OM were 2D convolutional layers and the FEM was composed of a cascaded of residual Swin transformer blocks (RSTBs) and 2D convolutional layers. The RSTB consisted of a series of Swin transformer layers (STLs). The shifted windows multi-head self-attention (W-MSA/SW-MSA) of STL was performed in shifted windows rather than the multi-head self-attention (MSA) of the original transformer in the whole image space. A novel multi-channel loss was proposed by using the sensitivity maps, which was proved to reserve more textures and details. We performed a series of comparative studies and ablation studies in the Calgary-Campinas public brain MR dataset and conducted a downstream segmentation experiment in the Multi-modal Brain Tumour Segmentation Challenge 2017 dataset. The results demonstrate our SwinMR achieved high-quality reconstruction compared with other benchmark methods, and it shows great robustness with different undersampling masks, under noise interruption and on different datasets. The code is publicly available at https://github.com/ayanglab/SwinMR.
    IoTGAN: GAN Powered Camouflage Against Machine Learning Based IoT Device Identification. (arXiv:2201.03281v1 [cs.CR])
    (0 min) With the proliferation of IoT devices, researchers have developed a variety of IoT device identification methods with the assistance of machine learning. Nevertheless, the security of these identification methods mostly depends on collected training data. In this research, we propose a novel attack strategy named IoTGAN to manipulate an IoT device's traffic such that it can evade machine learning based IoT device identification. In the development of IoTGAN, we have two major technical challenges: (i) How to obtain the discriminative model in a black-box setting, and (ii) How to add perturbations to IoT traffic through the manipulative model, so as to evade the identification while not influencing the functionality of IoT devices. To address these challenges, a neural network based substitute model is used to fit the target model in black-box settings, it works as a discriminative model in IoTGAN. A manipulative model is trained to add adversarial perturbations into the IoT device's traffic to evade the substitute model. Experimental results show that IoTGAN can successfully achieve the attack goals. We also develop efficient countermeasures to protect machine learning based IoT device identification from been undermined by IoTGAN.
    Surrogate-assisted performance prediction for data-driven knowledge discovery algorithms: application to evolutionary modeling of clinical pathways. (arXiv:2004.01123v2 [cs.LG] UPDATED)
    (0 min) The paper proposes and investigates an approach for surrogate-assisted performance prediction of data-driven knowledge discovery algorithms. The approach is based on the identification of surrogate models for prediction of the target algorithm's quality and performance. The proposed approach was implemented and investigated as applied to an evolutionary algorithm for discovering clusters of interpretable clinical pathways in electronic health records of patients with acute coronary syndrome. Several clustering metrics and execution time were used as the target quality and performance metrics respectively. An analytical software prototype based on the proposed approach for the prediction of algorithm characteristics and feature analysis was developed to provide a more interpretable prediction of the target algorithm's performance and quality that can be further used for parameter tuning.
    Automated Lay Language Summarization of Biomedical Scientific Reviews. (arXiv:2012.12573v3 [cs.CL] UPDATED)
    (0 min) Health literacy has emerged as a crucial factor in making appropriate health decisions and ensuring treatment outcomes. However, medical jargon and the complex structure of professional language in this domain make health information especially hard to interpret. Thus, there is an urgent unmet need for automated methods to enhance the accessibility of the biomedical literature to the general population. This problem can be framed as a type of translation problem between the language of healthcare professionals, and that of the general public. In this paper, we introduce the novel task of automated generation of lay language summaries of biomedical scientific reviews, and construct a dataset to support the development and evaluation of automated methods through which to enhance the accessibility of the biomedical literature. We conduct analyses of the various challenges in solving this task, including not only summarization of the key points but also explanation of background knowledge and simplification of professional language. We experiment with state-of-the-art summarization models as well as several data augmentation techniques, and evaluate their performance using both automated metrics and human assessment. Results indicate that automatically generated summaries produced using contemporary neural architectures can achieve promising quality and readability as compared with reference summaries developed for the lay public by experts (best ROUGE-L of 50.24 and Flesch-Kincaid readability score of 13.30). We also discuss the limitations of the current attempt, providing insights and directions for future work.
    Trade-off on Sim2Real Learning: Real-world Learning Faster than Simulations. (arXiv:2007.10675v4 [cs.RO] UPDATED)
    (0 min) Deep Reinforcement Learning (DRL) experiments are commonly performed in simulated environments due to the tremendous training sample demands from deep neural networks. In contrast, model-based Bayesian Learning allows a robot to learn good policies within a few trials in the real world. Although it takes fewer iterations, Bayesian methods pay a relatively higher computational cost per trial, and the advantage of such methods is strongly tied to dimensionality and noise. In here, we compare a Deep Bayesian Learning algorithm with a model-free DRL algorithm while analyzing our results collected from both simulations and real-world experiments. While considering Sim and Real learning, our experiments show that the sample-efficient Deep Bayesian RL performance is better than DRL even when computation time (as opposed to number of iterations) is taken in consideration. Additionally, the difference in computation time between Deep Bayesian RL performed in simulation and in experiments point to a viable path to traverse the reality gap. We also show that a mix between Sim and Real does not outperform a purely Real approach, pointing to the possibility that reality can provide the best prior knowledge to a Bayesian Learning. Roboticists design and build robots every day, and our results show that a higher learning efficiency in the real-world will shorten the time between design and deployment by skipping simulations.
    Toward a Fairness-Aware Scoring System for Algorithmic Decision-Making. (arXiv:2109.10053v3 [cs.LG] UPDATED)
    (0 min) Scoring systems, as a type of predictive model, have significant advantages in interpretability and transparency and facilitate quick decision-making. As such, scoring systems have been extensively used in a wide variety of industries such as healthcare and criminal justice. However, the fairness issues in these models have long been criticized, and the use of big data and machine learning algorithms in the construction of scoring systems heightens this concern. In this paper, we propose a general framework to create fairness-aware, data-driven scoring systems. First, we develop a social welfare function that incorporates both efficiency and group fairness. Then, we transform the social welfare maximization problem into the risk minimization task in machine learning, and derive a fairness-aware scoring system with the help of mixed integer programming. Lastly, several theoretical bounds are derived for providing parameter selection suggestions. Our proposed framework provides a suitable solution to address group fairness concerns in the development of scoring systems. It enables policymakers to set and customize their desired fairness requirements as well as other application-specific constraints. We test the proposed algorithm with several empirical data sets. Experimental evidence supports the effectiveness of the proposed scoring system in achieving the optimal welfare of stakeholders and in balancing the needs for interpretability, fairness, and efficiency.
    Bridging the gap to real-world for network intrusion detection systems with data-centric approach. (arXiv:2110.13655v2 [cs.CR] UPDATED)
    (0 min) Most research using machine learning (ML) for network intrusion detection systems (NIDS) uses well-established datasets such as KDD-CUP99, NSL-KDD, UNSW-NB15, and CICIDS-2017. In this context, the possibilities of machine learning techniques are explored, aiming for metrics improvements compared to the published baselines (model-centric approach). However, those datasets present some limitations as aging that make it unfeasible to transpose those ML-based solutions to real-world applications. This paper presents a systematic data-centric approach to address the current limitations of NIDS research, specifically the datasets. This approach generates NIDS datasets composed of the most recent network traffic and attacks, with the labeling process integrated by design.
    GBRS: An Unified Model of Pawlak Rough Set and Neighborhood Rough Set. (arXiv:2201.03349v1 [cs.AI])
    (0 min) Pawlak rough set and neighborhood rough set are the two most common rough set theoretical models. Pawlawk can use equivalence classes to represent knowledge, but it cannot process continuous data; neighborhood rough sets can process continuous data, but it loses the ability of using equivalence classes to represent knowledge. To this end, this paper presents a granular-ball rough set based on the granlar-ball computing. The granular-ball rough set can simultaneously represent Pawlak rough sets, and the neighborhood rough set, so as to realize the unified representation of the two. This makes the granular-ball rough set not only can deal with continuous data, but also can use equivalence classes for knowledge representation. In addition, we propose an implementation algorithms of granular-ball rough sets. The experimental resuts on benchmark datasets demonstrate that, due to the combination of the robustness and adaptability of the granular-ball computing, the learning accuracy of the granular-ball rough set has been greatly improved compared with the Pawlak rough set and the traditional neighborhood rough set. The granular-ball rough set also outperforms nine popular or the state-of-the-art feature selection methods.
    DiffSDFSim: Differentiable Rigid-Body Dynamics With Implicit Shapes. (arXiv:2111.15318v2 [cs.CV] UPDATED)
    (0 min) Differentiable physics is a powerful tool in computer vision and robotics for scene understanding and reasoning about interactions. Existing approaches have frequently been limited to objects with simple shape or shapes that are known in advance. In this paper, we propose a novel approach to differentiable physics with frictional contacts which represents object shapes implicitly using signed distance fields (SDFs). Our simulation supports contact point calculation even when the involved shapes are nonconvex. Moreover, we propose ways for differentiating the dynamics for the object shape to facilitate shape optimization using gradient-based methods. In our experiments, we demonstrate that our approach allows for model-based inference of physical parameters such as friction coefficients, mass, forces or shape parameters from trajectory and depth image observations in several challenging synthetic scenarios and a real image sequence.
    Optimal radial basis for density-based atomic representations. (arXiv:2105.08717v2 [stat.ML] UPDATED)
    (0 min) The input of almost every machine learning algorithm targeting the properties of matter at the atomic scale involves a transformation of the list of Cartesian atomic coordinates into a more symmetric representation. Many of the most popular representations can be seen as an expansion of the symmetrized correlations of the atom density, and differ mainly by the choice of basis. Considerable effort has been dedicated to the optimization of the basis set, typically driven by heuristic considerations on the behavior of the regression target. Here we take a different, unsupervised viewpoint, aiming to determine the basis that encodes in the most compact way possible the structural information that is relevant for the dataset at hand. For each training dataset and number of basis functions, one can determine a unique basis that is optimal in this sense, and can be computed at no additional cost with respect to the primitive basis by approximating it with splines. We demonstrate that this construction yields representations that are accurate and computationally efficient, particularly when constructing representations that correspond to high-body order correlations. We present examples that involve both molecular and condensed-phase machine-learning models.
    Attention is All You Need? Good Embeddings with Statistics are enough:Large Scale Audio Understanding without Transformers/ Convolutions/ BERTs/ Mixers/ Attention/ RNNs or ..... (arXiv:2110.03183v3 [cs.SD] UPDATED)
    (0 min) This paper presents a way of doing large scale audio understanding without traditional state of the art neural architectures. Ever since the introduction of deep learning for understanding audio signals in the past decade, convolutional architectures have been able to achieve state of the art results surpassing traditional hand-crafted features. In the recent past, there has been a similar shift away from traditional convolutional and recurrent neural networks towards purely end-to-end Transformer architectures. We, in this work, explore an approach, based on Bag-of-Words model. Our approach does not have any convolutions, recurrence, attention, transformers or other approaches such as BERT. We utilize micro and macro level clustered vanilla embeddings, and use a MLP head for classification. We only use feed-forward encoder-decoder models to get the bottlenecks of spectral envelops, spectral patches and slices as well as multi-resolution spectra. A classification head (a feed-forward layer), similar to the approach in SimCLR is trained on a learned representation. Using simple codes learned on latent representations, we show how we surpass traditional convolutional neural network architectures, and come strikingly close to outperforming powerful Transformer architectures. This work hopefully would pave way for exciting advancements in the field of representation learning without massive, end-to-end neural architectures.
    Unsupervised Image Decomposition with Phase-Correlation Networks. (arXiv:2110.03473v3 [cs.CV] UPDATED)
    (0 min) The ability to decompose scenes into their object components is a desired property for autonomous agents, allowing them to reason and act in their surroundings. Recently, different methods have been proposed to learn object-centric representations from data in an unsupervised manner. These methods often rely on latent representations learned by deep neural networks, hence requiring high computational costs and large amounts of curated data. Such models are also difficult to interpret. To address these challenges, we propose the Phase-Correlation Decomposition Network (PCDNet), a novel model that decomposes a scene into its object components, which are represented as transformed versions of a set of learned object prototypes. The core building block in PCDNet is the Phase-Correlation Cell (PC Cell), which exploits the frequency-domain representation of the images in order to estimate the transformation between an object prototype and its transformed version in the image. In our experiments, we show how PCDNet outperforms state-of-the-art methods for unsupervised object discovery and segmentation on simple benchmark datasets and on more challenging data, while using a small number of learnable parameters and being fully interpretable.
    Continual Learning via Bit-Level Information Preserving. (arXiv:2105.04444v2 [cs.LG] UPDATED)
    (0 min) Continual learning tackles the setting of learning different tasks sequentially. Despite the lots of previous solutions, most of them still suffer significant forgetting or expensive memory cost. In this work, targeted at these problems, we first study the continual learning process through the lens of information theory and observe that forgetting of a model stems from the loss of \emph{information gain} on its parameters from the previous tasks when learning a new task. From this viewpoint, we then propose a novel continual learning approach called Bit-Level Information Preserving (BLIP) that preserves the information gain on model parameters through updating the parameters at the bit level, which can be conveniently implemented with parameter quantization. More specifically, BLIP first trains a neural network with weight quantization on the new incoming task and then estimates information gain on each parameter provided by the task data to determine the bits to be frozen to prevent forgetting. We conduct extensive experiments ranging from classification tasks to reinforcement learning tasks, and the results show that our method produces better or on par results comparing to previous state-of-the-arts. Indeed, BLIP achieves close to zero forgetting while only requiring constant memory overheads throughout continual learning.
    Retiring Adult: New Datasets for Fair Machine Learning. (arXiv:2108.04884v3 [cs.LG] UPDATED)
    (0 min) Although the fairness community has recognized the importance of data, researchers in the area primarily rely on UCI Adult when it comes to tabular data. Derived from a 1994 US Census survey, this dataset has appeared in hundreds of research papers where it served as the basis for the development and comparison of many algorithmic fairness interventions. We reconstruct a superset of the UCI Adult data from available US Census sources and reveal idiosyncrasies of the UCI Adult dataset that limit its external validity. Our primary contribution is a suite of new datasets derived from US Census surveys that extend the existing data ecosystem for research on fair machine learning. We create prediction tasks relating to income, employment, health, transportation, and housing. The data span multiple years and all states of the United States, allowing researchers to study temporal shift and geographic variation. We highlight a broad initial sweep of new empirical insights relating to trade-offs between fairness criteria, performance of algorithmic interventions, and the role of distribution shift based on our new datasets. Our findings inform ongoing debates, challenge some existing narratives, and point to future research directions. Our datasets are available at https://github.com/zykls/folktables.
    Sustainable AI: Environmental Implications, Challenges and Opportunities. (arXiv:2111.00364v2 [cs.LG] UPDATED)
    (0 min) This paper explores the environmental impact of the super-linear growth trends for AI from a holistic perspective, spanning Data, Algorithms, and System Hardware. We characterize the carbon footprint of AI computing by examining the model development cycle across industry-scale machine learning use cases and, at the same time, considering the life cycle of system hardware. Taking a step further, we capture the operational and manufacturing carbon footprint of AI computing and present an end-to-end analysis for what and how hardware-software design and at-scale optimization can help reduce the overall carbon footprint of AI. Based on the industry experience and lessons learned, we share the key challenges and chart out important development directions across the many dimensions of AI. We hope the key messages and insights presented in this paper can inspire the community to advance the field of AI in an environmentally-responsible manner.
    Grounded Relational Inference: Domain Knowledge Driven Explainable Autonomous Driving. (arXiv:2102.11905v2 [cs.AI] UPDATED)
    (0 min) Explainability is essential for autonomous vehicles and other robotics systems interacting with humans and other objects during operation. Humans need to understand and anticipate the actions taken by the machines for trustful and safe cooperation. In this work, we aim to develop an explainable model that generates explanations consistent with both human domain knowledge and the model's inherent causal relation. In particular, we focus on an essential building block of autonomous driving, multi-agent interaction modeling. We propose Grounded Relational Inference (GRI). It models an interactive system's underlying dynamics by inferring an interaction graph representing the agents' relations. We ensure a semantically meaningful interaction graph by grounding the relational latent space into semantic interactive behaviors defined with expert domain knowledge. We demonstrate that it can model interactive traffic scenarios under both simulation and real-world settings, and generate semantic graphs explaining the vehicle's behavior by their interactions.
    A Deep Learning-Accelerated Data Assimilation and Forecasting Workflow for Commercial-Scale Geologic Carbon Storage. (arXiv:2105.09468v2 [physics.geo-ph] UPDATED)
    (0 min) Fast assimilation of monitoring data to update forecasts of pressure buildup and carbon dioxide (CO2) plume migration under geologic uncertainties is a challenging problem in geologic carbon storage. The high computational cost of data assimilation with a high-dimensional parameter space impedes fast decision-making for commercial-scale reservoir management. We propose to leverage physical understandings of porous medium flow behavior with deep learning techniques to develop a fast history matching-reservoir response forecasting workflow. Applying an Ensemble Smoother Multiple Data Assimilation framework, the workflow updates geologic properties and predicts reservoir performance with quantified uncertainty from pressure history and CO2 plumes interpreted through seismic inversion. As the most computationally expensive component in such a workflow is reservoir simulation, we developed surrogate models to predict dynamic pressure and CO2 plume extents under multi-well injection. The surrogate models employ deep convolutional neural networks, specifically, a wide residual network and a residual U-Net. The workflow is validated against a flat three-dimensional reservoir model representative of a clastic shelf depositional environment. Intelligent treatments are applied to bridge between quantities in a true-3D reservoir model and those in a single-layer reservoir model. The workflow can complete history matching and reservoir forecasting with uncertainty quantification in less than one hour on a mainstream personal workstation.
    Comparison of Representation Learning Techniques for Tracking in time resolved 3D Ultrasound. (arXiv:2201.03319v1 [eess.IV])
    (0 min) 3D ultrasound (3DUS) becomes more interesting for target tracking in radiation therapy due to its capability to provide volumetric images in real-time without using ionizing radiation. It is potentially usable for tracking without using fiducials. For this, a method for learning meaningful representations would be useful to recognize anatomical structures in different time frames in representation space (r-space). In this study, 3DUS patches are reduced into a 128-dimensional r-space using conventional autoencoder, variational autoencoder and sliced-wasserstein autoencoder. In the r-space, the capability of separating different ultrasound patches as well as recognizing similar patches is investigated and compared based on a dataset of liver images. Two metrics to evaluate the tracking capability in the r-space are proposed. It is shown that ultrasound patches with different anatomical structures can be distinguished and sets of similar patches can be clustered in r-space. The results indicate that the investigated autoencoders have different levels of usability for target tracking in 3DUS.
    Individual Privacy Accounting via a Renyi Filter. (arXiv:2008.11193v4 [cs.CR] UPDATED)
    (0 min) We consider a sequential setting in which a single dataset of individuals is used to perform adaptively-chosen analyses, while ensuring that the differential privacy loss of each participant does not exceed a pre-specified privacy budget. The standard approach to this problem relies on bounding a worst-case estimate of the privacy loss over all individuals and all possible values of their data, for every single analysis. Yet, in many scenarios this approach is overly conservative, especially for "typical" data points which incur little privacy loss by participation in most of the analyses. In this work, we give a method for tighter privacy loss accounting based on the value of a personalized privacy loss estimate for each individual in each analysis. To implement the accounting method we design a filter for R\'enyi differential privacy. A filter is a tool that ensures that the privacy parameter of a composed sequence of algorithms with adaptively-chosen privacy parameters does not exceed a pre-specified budget. Our filter is simpler and tighter than the known filter for $(\epsilon,\delta)$-differential privacy by Rogers et al. We apply our results to the analysis of noisy gradient descent and show that personalized accounting can be practical, easy to implement, and can only make the privacy-utility tradeoff tighter.
    Gait Recognition Based on Deep Learning: A Survey. (arXiv:2201.03323v1 [cs.CV])
    (0 min) In general, biometry-based control systems may not rely on individual expected behavior or cooperation to operate appropriately. Instead, such systems should be aware of malicious procedures for unauthorized access attempts. Some works available in the literature suggest addressing the problem through gait recognition approaches. Such methods aim at identifying human beings through intrinsic perceptible features, despite dressed clothes or accessories. Although the issue denotes a relatively long-time challenge, most of the techniques developed to handle the problem present several drawbacks related to feature extraction and low classification rates, among other issues. However, deep learning-based approaches recently emerged as a robust set of tools to deal with virtually any image and computer-vision related problem, providing paramount results for gait recognition as well. Therefore, this work provides a surveyed compilation of recent works regarding biometric detection through gait recognition with a focus on deep learning approaches, emphasizing their benefits, and exposing their weaknesses. Besides, it also presents categorized and characterized descriptions of the datasets, approaches, and architectures employed to tackle associated constraints.
    Studying Hadronization by Machine Learning Techniques. (arXiv:2111.15655v2 [hep-ph] UPDATED)
    (0 min) Hadronization is a non-perturbative process, which theoretical description can not be deduced from first principles. Modeling hadron formation requires several assumptions and various phenomenological approaches. Utilizing state-of-the-art Computer Vision and Deep Learning algorithms, it is eventually possible to train neural networks to learn non-linear and non-perturbative features of the physical processes. In this study, results of two ResNet networks are presented by investigating global and kinematical quantities, indeed jet- and event-shape variables. The widely used Lund string fragmentation model is applied as a baseline in $\sqrt{s}= 7$ TeV proton-proton collisions to predict the most relevant observables at further LHC energies.
    Continual Learning of Long Topic Sequences in Neural Information Retrieval. (arXiv:2201.03356v1 [cs.IR])
    (0 min) In information retrieval (IR) systems, trends and users' interests may change over time, altering either the distribution of requests or contents to be recommended. Since neural ranking approaches heavily depend on the training data, it is crucial to understand the transfer capacity of recent IR approaches to address new domains in the long term. In this paper, we first propose a dataset based upon the MSMarco corpus aiming at modeling a long stream of topics as well as IR property-driven controlled settings. We then in-depth analyze the ability of recent neural IR models while continually learning those streams. Our empirical study highlights in which particular cases catastrophic forgetting occurs (e.g., level of similarity between tasks, peculiarities on text length, and ways of learning models) to provide future directions in terms of model design.
    SUPER-ADAM: Faster and Universal Framework of Adaptive Gradients. (arXiv:2106.08208v8 [math.OC] UPDATED)
    (0 min) Adaptive gradient methods have shown excellent performances for solving many machine learning problems. Although multiple adaptive gradient methods were recently studied, they mainly focus on either empirical or theoretical aspects and also only work for specific problems by using some specific adaptive learning rates. Thus, it is desired to design a universal framework for practical algorithms of adaptive gradients with theoretical guarantee to solve general problems. To fill this gap, we propose a faster and universal framework of adaptive gradients (i.e., SUPER-ADAM) by introducing a universal adaptive matrix that includes most existing adaptive gradient forms. Moreover, our framework can flexibly integrate the momentum and variance reduced techniques. In particular, our novel framework provides the convergence analysis support for adaptive gradient methods under the nonconvex setting. In theoretical analysis, we prove that our SUPER-ADAM algorithm can achieve the best known gradient (i.e., stochastic first-order oracle (SFO)) complexity of $\tilde{O}(\epsilon^{-3})$ for finding an $\epsilon$-stationary point of nonconvex optimization, which matches the lower bound for stochastic smooth nonconvex optimization. In numerical experiments, we employ various deep learning tasks to validate that our algorithm consistently outperforms the existing adaptive algorithms. Code is available at https://github.com/LIJUNYI95/SuperAdam
    GraphFM: Graph Factorization Machines for Feature Interaction Modeling. (arXiv:2105.11866v3 [cs.LG] UPDATED)
    (0 min) Factorization machine (FM) is a prevalent approach to modeling pairwise (second-order) feature interactions when dealing with high-dimensional sparse data. However, on the one hand, FM fails to capture higher-order feature interactions suffering from combinatorial expansion, on the other hand, taking into account interaction between every pair of features may introduce noise and degrade prediction accuracy. To solve the problems, we propose a novel approach Graph Factorization Machine (GraphFM) by naturally representing features in the graph structure. In particular, a novel mechanism is designed to select the beneficial feature interactions and formulate them as edges between features. Then our proposed model which integrates the interaction function of FM into the feature aggregation strategy of Graph Neural Network (GNN), can model arbitrary-order feature interactions on the graph-structured features by stacking layers. Experimental results on several real-world datasets has demonstrated the rationality and effectiveness of our proposed approach.
    GDI: Rethinking What Makes Reinforcement Learning Different From Supervised Learning. (arXiv:2106.06232v6 [cs.LG] UPDATED)
    (0 min) Deep Q Network (DQN) firstly kicked the door of deep reinforcement learning (DRL) via combining deep learning (DL) with reinforcement learning (RL), which has noticed that the distribution of the acquired data would change during the training process. DQN found this property might cause instability for training, so it proposed effective methods to handle the downside of the property. Instead of focusing on the unfavourable aspects, we find it critical for RL to ease the gap between the estimated data distribution and the ground truth data distribution while supervised learning (SL) fails to do so. From this new perspective, we extend the basic paradigm of RL called the Generalized Policy Iteration (GPI) into a more generalized version, which is called the Generalized Data Distribution Iteration (GDI). We see massive RL algorithms and techniques can be unified into the GDI paradigm, which can be considered as one of the special cases of GDI. We provide theoretical proof of why GDI is better than GPI and how it works. Several practical algorithms based on GDI have been proposed to verify the effectiveness and extensiveness of it. Empirical experiments prove our state-of-the-art (SOTA) performance on Arcade Learning Environment (ALE), wherein our algorithm has achieved 9620.98% mean human normalized score (HNS), 1146.39% median HNS and 22 human world record breakthroughs (HWRB) using only 200M training frames. Our work aims to lead the RL research to step into the journey of conquering the human world records and seek real superhuman agents on both performance and efficiency.
    Refinement of Hottopixx Method for Nonnegative Matrix Factorization Under Noisy Separability. (arXiv:2109.02863v2 [cs.LG] UPDATED)
    (0 min) Hottopixx, proposed by Bittorf et al. at NIPS 2012, is an algorithm for solving nonnegative matrix factorization (NMF) problems under the separability assumption. Separable NMFs have important applications, such as topic extraction from documents and unmixing of hyperspectral images. In such applications, the robustness of the algorithm to noise is the key to the success. Hottopixx has been shown to be robust to noise, and its robustness can be further enhanced through postprocessing. However, there is a drawback. Hottopixx and its postprocessing require us to estimate the noise level involved in the matrix we want to factorize before running, since they use it as part of the input data. The noise-level estimation is not an easy task. In this paper, we overcome this drawback. We present a refinement of Hottopixx and its postprocessing that runs without prior knowledge of the noise level. We show that the refinement has almost the same robustness to noise as the original algorithm.
    GTAdam: Gradient Tracking with Adaptive Momentum for Distributed Online Optimization. (arXiv:2009.01745v2 [math.OC] UPDATED)
    (0 min) This paper deals with a network of computing agents aiming to solve an online optimization problem in a distributed fashion, i.e., by means of local computation and communication, without any central coordinator. We propose the gradient tracking with adaptive momentum estimation (GTAdam) distributed algorithm, which combines a gradient tracking mechanism with first and second order momentum estimates of the gradient. The algorithm is analyzed in the online setting for strongly convex cost functions with Lipschitz continuous gradients. We provide an upper bound for the dynamic regret given by a term related to the initial conditions, and another term related to the temporal variations of the objective functions. Moreover, a linear convergence rate is guaranteed in the static set-up. The algorithm is tested on a time-varying classification problem, on a (moving) target localization problem and in a stochastic optimization setup from image classification. In these numerical experiments from multi-agent learning, GTAdam outperforms state-of-the-art distributed optimization methods.
    Stability Analysis of Unfolded WMMSE for Power Allocation. (arXiv:2110.07471v2 [eess.SP] UPDATED)
    (0 min) Power allocation is one of the fundamental problems in wireless networks and a wide variety of algorithms address this problem from different perspectives. A common element among these algorithms is that they rely on an estimation of the channel state, which may be inaccurate on account of hardware defects, noisy feedback systems, and environmental and adversarial disturbances. Therefore, it is essential that the output power allocation of these algorithms is stable with respect to input perturbations, to the extent that the variations in the output are bounded for bounded variations in the input. In this paper, we focus on UWMMSE -- a modern algorithm leveraging graph neural networks --, and illustrate its stability to additive input perturbations of bounded energy through both theoretical analysis and empirical validation.
    Understanding occupants' behaviour, engagement, emotion, and comfort indoors with heterogeneous sensors and wearables. (arXiv:2105.06637v2 [cs.HC] UPDATED)
    (0 min) We conducted a field study at a K-12 private school in the suburbs of Melbourne, Australia. The data capture contained two elements: First, a 5-month longitudinal field study In-Gauge using two outdoor weather stations, as well as indoor weather stations in 17 classrooms and temperature sensors on the vents of occupant-controlled room air-conditioners; these were collated into individual datasets for each classroom at a 5-minute logging frequency, including additional data on occupant presence. The dataset was used to derive predictive models of how occupants operate room air-conditioning units. Second, we tracked 23 students and 6 teachers in a 4-week cross-sectional study En-Gage, using wearable sensors to log physiological data, as well as daily surveys to query the occupants' thermal comfort, learning engagement, emotions and seating behaviours. Overall, the combined dataset could be used to analyse the relationships between indoor/outdoor climates and students' behaviours/mental states on campus, which provide opportunities for the future design of intelligent feedback systems to benefit both students and staff.
    Federated learning and next generation wireless communications: A survey on bidirectional relationship. (arXiv:2110.07649v2 [eess.SP] UPDATED)
    (0 min) In order to meet the extremely heterogeneous requirements of the next generation wireless communication networks, research community is increasingly dependent on using machine learning solutions for real-time decision-making and radio resource management. Traditional machine learning employs fully centralized architecture in which the entire training data is collected at one node e.g., cloud server, that significantly increases the communication overheads and also raises severe privacy concerns. Towards this end, a distributed machine learning paradigm termed as Federated learning (FL) has been proposed recently. In FL, each participating edge device trains its local model by using its own training data. Then, via the wireless channels the weights or parameters of the locally trained models are sent to the central PS, that aggregates them and updates the global model. On one hand, FL plays an important role for optimizing the resources of wireless communication networks, on the other hand, wireless communications is crucial for FL. Thus, a `bidirectional' relationship exists between FL and wireless communications. Although FL is an emerging concept, many publications have already been published in the domain of FL and its applications for next generation wireless networks. Nevertheless, we noticed that none of the works have highlighted the bidirectional relationship between FL and wireless communications. Therefore, the purpose of this survey paper is to bridge this gap in literature by providing a timely and comprehensive discussion on the interdependency between FL and wireless communications.
    Wiki-CS: A Wikipedia-Based Benchmark for Graph Neural Networks. (arXiv:2007.02901v2 [cs.LG] UPDATED)
    (0 min) We present Wiki-CS, a novel dataset derived from Wikipedia for benchmarking Graph Neural Networks. The dataset consists of nodes corresponding to Computer Science articles, with edges based on hyperlinks and 10 classes representing different branches of the field. We use the dataset to evaluate semi-supervised node classification and single-relation link prediction models. Our experiments show that these methods perform well on a new domain, with structural properties different from earlier benchmarks. The dataset is publicly available, along with the implementation of the data pipeline and the benchmark experiments, at https://github.com/pmernyei/wiki-cs-dataset .
    Enabling Fast Deep Learning on Tiny Energy-Harvesting IoT Devices. (arXiv:2111.14051v2 [cs.LG] UPDATED)
    (0 min) Energy harvesting (EH) IoT devices that operate intermittently without batteries, coupled with advances in deep neural networks (DNNs), have opened up new opportunities for enabling sustainable smart applications. Nevertheless, implementing those computation and memory-intensive intelligent algorithms on EH devices is extremely difficult due to the challenges of limited resources and intermittent power supply that causes frequent failures. To address those challenges, this paper proposes a methodology that enables fast deep learning with low-energy accelerators for tiny energy harvesting devices. We first propose $RAD$, a resource-aware structured DNN training framework, which employs block circulant matrix and structured pruning to achieve high compression for leveraging the advantage of various vector operation accelerators. A DNN implementation method, $ACE$, is then proposed that employs low-energy accelerators to profit maximum performance with small energy consumption. Finally, we further design $FLEX$, the system support for intermittent computation in energy harvesting situations. Experimental results from three different DNN models demonstrate that $RAD$, $ACE$, and $FLEX$ can enable fast and correct inference on energy harvesting devices with up to 4.26X runtime reduction, up to 7.7X energy reduction with higher accuracy over the state-of-the-art.
    Tiny Adversarial Mulit-Objective Oneshot Neural Architecture Search. (arXiv:2103.00363v2 [cs.LG] UPDATED)
    (0 min) Due to limited computational cost and energy consumption, most neural network models deployed in mobile devices are tiny. However, tiny neural networks are commonly very vulnerable to attacks. Current research has proved that larger model size can improve robustness, but little research focuses on how to enhance the robustness of tiny neural networks. Our work focuses on how to improve the robustness of tiny neural networks without seriously deteriorating of clean accuracy under mobile-level resources. To this end, we propose a multi-objective oneshot network architecture search (NAS) algorithm to obtain the best trade-off networks in terms of the adversarial accuracy, the clean accuracy and the model size. Specifically, we design a novel search space based on new tiny blocks and channels to balance model size and adversarial performance. Moreover, since the supernet significantly affects the performance of subnets in our NAS algorithm, we reveal the insights into how the supernet helps to obtain the best subnet under white-box adversarial attacks. Concretely, we explore a new adversarial training paradigm by analyzing the adversarial transferability, the width of the supernet and the difference between training the subnets from scratch and fine-tuning. Finally, we make a statistical analysis for the layer-wise combination of certain blocks and channels on the first non-dominated front, which can serve as a guideline to design tiny neural network architectures for the resilience of adversarial perturbations.
    A Survey on Deep Learning and Explainability for Automatic Report Generation from Medical Images. (arXiv:2010.10563v2 [cs.CV] UPDATED)
    (0 min) Every year physicians face an increasing demand of image-based diagnosis from patients, a problem that can be addressed with recent artificial intelligence methods. In this context, we survey works in the area of automatic report generation from medical images, with emphasis on methods using deep neural networks, with respect to: (1) Datasets, (2) Architecture Design, (3) Explainability and (4) Evaluation Metrics. Our survey identifies interesting developments, but also remaining challenges. Among them, the current evaluation of generated reports is especially weak, since it mostly relies on traditional Natural Language Processing (NLP) metrics, which do not accurately capture medical correctness.
    Effective Sample Size, Dimensionality, and Generalization in Covariate Shift Adaptation. (arXiv:2010.01184v5 [stat.ML] UPDATED)
    (0 min) In supervised learning, training and test datasets are often sampled from distinct distributions. Domain adaptation techniques are thus required. Covariate shift adaptation yields good generalization performance when domains differ only by the marginal distribution of features. Covariate shift adaptation is usually implemented using importance weighting, which may fail, according to common wisdom, due to small effective sample sizes (ESS). Previous research argues this scenario is more common in high-dimensional settings. However, how effective sample size, dimensionality, and model performance/generalization are formally related in supervised learning, considering the context of covariate shift adaptation, is still somewhat obscure in the literature. Thus, a main challenge is presenting a unified theory connecting those points. Hence, in this paper, we focus on building a unified view connecting the ESS, data dimensionality, and generalization in the context of covariate shift adaptation. Moreover, we also demonstrate how dimensionality reduction or feature selection can increase the ESS, and argue that our results support dimensionality reduction before covariate shift adaptation as a good practice.
    COIN: Counterfactual Image Generation for VQA Interpretation. (arXiv:2201.03342v1 [cs.CV])
    (0 min) Due to the significant advancement of Natural Language Processing and Computer Vision-based models, Visual Question Answering (VQA) systems are becoming more intelligent and advanced. However, they are still error-prone when dealing with relatively complex questions. Therefore, it is important to understand the behaviour of the VQA models before adopting their results. In this paper, we introduce an interpretability approach for VQA models by generating counterfactual images. Specifically, the generated image is supposed to have the minimal possible change to the original image and leads the VQA model to give a different answer. In addition, our approach ensures that the generated image is realistic. Since quantitative metrics cannot be employed to evaluate the interpretability of the model, we carried out a user study to assess different aspects of our approach. In addition to interpreting the result of VQA models on single images, the obtained results and the discussion provides an extensive explanation of VQA models' behaviour.
    Addressing the Lack of Comparability & Testing in CAN Intrusion Detection Research: A Comprehensive Guide to CAN IDS Data & Introduction of the ROAD Dataset. (arXiv:2012.14600v2 [cs.CR] UPDATED)
    (0 min) Although ubiquitous in modern vehicles, Controller Area Networks (CANs) lack basic security properties and are easily exploitable. A rapidly growing field of CAN security research has emerged that seeks to detect intrusions on CANs. Producing vehicular CAN data with a variety of intrusions is out of reach for most researchers as it requires expensive assets and expertise. To assist researchers, we present the first comprehensive guide to the existing open CAN intrusion datasets, including a quality analysis of each dataset and an enumeration of each's benefits, drawbacks, and suggested use case. Current public CAN IDS datasets are limited to real fabrication (simple message injection) attacks and simulated attacks often in synthetic data, which lack fidelity. In general, the physical effects of attacks on the vehicle are not verified in the available datasets. Only one dataset provides signal-translated data but not a corresponding raw binary version. Overall, the available data pigeon-holes CAN IDS works into testing on limited, often inappropriate data (usually with attacks that are too easily detectable to truly test the method), and this lack data has stymied comparability and reproducibility of results. As our primary contribution, we present the ROAD (Real ORNL Automotive Dynamometer) CAN Intrusion Dataset, consisting of over 3.5 hours of one vehicle's CAN data. ROAD contains ambient data recorded during a diverse set of activities, and attacks of increasing stealth with multiple variants and instances of real fuzzing, fabrication, and unique advanced attacks, as well as simulated masquerade attacks. To facilitate benchmarking CAN IDS methods that require signal-translated inputs, we also provide the signal time series format for many of the CAN captures. Our contributions aim to facilitate appropriate benchmarking and needed comparability in the CAN IDS field.
    Project Achoo: A Practical Model and Application for COVID-19 Detection from Recordings of Breath, Voice, and Cough. (arXiv:2107.10716v2 [eess.SP] UPDATED)
    (0 min) The COVID-19 pandemic created a significant interest and demand for infection detection and monitoring solutions. In this paper we propose a machine learning method to quickly triage COVID-19 using recordings made on consumer devices. The approach combines signal processing methods with fine-tuned deep learning networks and provides methods for signal denoising, cough detection and classification. We have also developed and deployed a mobile application that uses symptoms checker together with voice, breath and cough signals to detect COVID-19 infection. The application showed robust performance on both open sourced datasets and on the noisy data collected during beta testing by the end users.
    BIGPrior: Towards Decoupling Learned Prior Hallucination and Data Fidelity in Image Restoration. (arXiv:2011.01406v3 [cs.CV] UPDATED)
    (0 min) Classic image-restoration algorithms use a variety of priors, either implicitly or explicitly. Their priors are hand-designed and their corresponding weights are heuristically assigned. Hence, deep learning methods often produce superior image restoration quality. Deep networks are, however, capable of inducing strong and hardly predictable hallucinations. Networks implicitly learn to be jointly faithful to the observed data while learning an image prior; and the separation of original data and hallucinated data downstream is then not possible. This limits their wide-spread adoption in image restoration. Furthermore, it is often the hallucinated part that is victim to degradation-model overfitting. We present an approach with decoupled network-prior based hallucination and data fidelity terms. We refer to our framework as the Bayesian Integration of a Generative Prior (BIGPrior). Our method is rooted in a Bayesian framework and tightly connected to classic restoration methods. In fact, it can be viewed as a generalization of a large family of classic restoration algorithms. We use network inversion to extract image prior information from a generative network. We show that, on image colorization, inpainting and denoising, our framework consistently improves the inversion results. Our method, though partly reliant on the quality of the generative network inversion, is competitive with state-of-the-art supervised and task-specific restoration methods. It also provides an additional metric that sets forth the degree of prior reliance per pixel relative to data fidelity.
    Applications of Deep Neural Networks. (arXiv:2009.05673v4 [cs.LG] UPDATED)
    (0 min) Deep learning is a group of exciting new technologies for neural networks. Through a combination of advanced training techniques and neural network architectural components, it is now possible to create neural networks that can handle tabular data, images, text, and audio as both input and output. Deep learning allows a neural network to learn hierarchies of information in a way that is like the function of the human brain. This course will introduce the student to classic neural network structures, Convolution Neural Networks (CNN), Long Short-Term Memory (LSTM), Gated Recurrent Neural Networks (GRU), General Adversarial Networks (GAN), and reinforcement learning. Application of these architectures to computer vision, time series, security, natural language processing (NLP), and data generation will be covered. High-Performance Computing (HPC) aspects will demonstrate how deep learning can be leveraged both on graphical processing units (GPUs), as well as grids. Focus is primarily upon the application of deep learning to problems, with some introduction to mathematical foundations. Readers will use the Python programming language to implement deep learning using Google TensorFlow and Keras. It is not necessary to know Python prior to this book; however, familiarity with at least one programming language is assumed.
    Small Object Detection using Deep Learning. (arXiv:2201.03243v1 [cs.CV])
    (0 min) Now a days, UAVs such as drones are greatly used for various purposes like that of capturing and target detection from ariel imagery etc. Easy access of these small ariel vehicles to public can cause serious security threats. For instance, critical places may be monitored by spies blended in public using drones. Study in hand proposes an improved and efficient Deep Learning based autonomous system which can detect and track very small drones with great precision. The proposed system consists of a custom deep learning model Tiny YOLOv3, one of the flavors of very fast object detection model You Look Only Once (YOLO) is built and used for detection. The object detection algorithm will efficiently the detect the drones. The proposed architecture has shown significantly better performance as compared to the previous YOLO version. The improvement is observed in the terms of resource usage and time complexity. The performance is measured using the metrics of recall and precision that are 93% and 91% respectively.
    Autonomous Kinetic Modeling of Biomass Pyrolysis using Chemical Reaction Neural Networks. (arXiv:2105.11397v2 [physics.chem-ph] UPDATED)
    (0 min) Modeling the burning processes of biomass such as wood, grass, and crops is crucial for the modeling and prediction of wildland and urban fire behavior. Despite its importance, the burning of solid fuels remains poorly understood, which can be partly attributed to the unknown chemical kinetics of most solid fuels. Most available kinetic models were built upon expert knowledge, which requires chemical insights and years of experience. This work presents a framework for autonomously discovering biomass pyrolysis kinetic models from thermogravimetric analyzer (TGA) experimental data using the recently developed chemical reaction neural networks (CRNN). The approach incorporated the CRNN model into the framework of neural ordinary differential equations to predict the residual mass in TGA data. In addition to the flexibility of neural-network-based models, the learned CRNN model is interpretable, by incorporating the fundamental physics laws, such as the law of mass action and Arrhenius law, into the neural network structure. The learned CRNN model can then be translated into the classical forms of biomass chemical kinetic models, which facilitates the extraction of chemical insights and the integration of the kinetic model into large-scale fire simulations. We demonstrated the effectiveness of the framework in predicting the pyrolysis and oxidation of cellulose. This successful demonstration opens the possibility of rapid and autonomous chemical kinetic modeling of solid fuels, such as wildfire fuels and industrial polymers.
    Discretization and Re-synthesis: an alternative method to solve the Cocktail Party Problem. (arXiv:2112.09382v2 [cs.SD] UPDATED)
    (0 min) Deep learning based models have significantly improved the performance of speech separation with input mixtures like the cocktail party. Prominent methods (e.g., frequency-domain and time-domain speech separation) usually build regression models to predict the ground-truth speech from the mixture, using the masking-based design and the signal-level loss criterion (e.g., MSE or SI-SNR). This study demonstrates, for the first time, that the synthesis-based approach can also perform well on this problem, with great flexibility and strong potential. Specifically, we propose a novel speech separation/enhancement model based on the recognition of discrete symbols, and convert the paradigm of the speech separation/enhancement related tasks from regression to classification. By utilizing the synthesis model with the input of discrete symbols, after the prediction of discrete symbol sequence, each target speech could be re-synthesized. Evaluation results based on the WSJ0-2mix and VCTK-noisy corpora in various settings show that our proposed method can steadily synthesize the separated speech with high speech quality and without any interference, which is difficult to avoid in regression-based methods. In addition, with negligible loss of listening quality, the speaker conversion of enhanced/separated speech could be easily realized through our method.
    Wind Park Power Prediction: Attention-Based Graph Networks and Deep Learning to Capture Wake Losses. (arXiv:2201.03229v1 [cs.LG])
    (0 min) With the increased penetration of wind energy into the power grid, it has become increasingly important to be able to predict the expected power production for larger wind farms. Deep learning (DL) models can learn complex patterns in the data and have found wide success in predicting wake losses and expected power production. This paper proposes a modular framework for attention-based graph neural networks (GNN), where attention can be applied to any desired component of a graph block. The results show that the model significantly outperforms a multilayer perceptron (MLP) and a bidirectional LSTM (BLSTM) model, while delivering performance on-par with a vanilla GNN model. Moreover, we argue that the proposed graph attention architecture can easily adapt to different applications by offering flexibility into the desired attention operations to be used, which might depend on the specific application. Through analysis of the attention weights, it was showed that employing attention-based GNNs can provide insights into what the models learn. In particular, the attention networks seemed to realise turbine dependencies that aligned with some physical intuition about wake losses.
    Federated Deep Learning in Electricity Forecasting: An MCDM Approach. (arXiv:2111.13834v3 [math.OC] UPDATED)
    (0 min) Large-scale data analysis is growing at an exponential rate as data proliferates in our societies. This abundance of data has the advantage of allowing the decision-maker to implement complex models in scenarios that were prohibitive before. At the same time, such an amount of data requires a distributed thinking approach. In fact, Deep Learning models require plenty of resources, and distributed training is needed. This paper presents a Multicriteria approach for distributed learning. Our approach uses the Weighted Goal Programming approach in its Chebyshev formulation to build an ensemble of decision rules that optimize aprioristically defined performance metrics. Such a formulation is beneficial because it is both model and metric agnostic and provides an interpretable output for the decision-maker. We test our approach by showing a practical application in electricity demand forecasting. Our results suggest that when we allow for dataset split overlapping, the performances of our methodology are consistently above the baseline model trained on the whole dataset.
    Towards Intrinsic Interactive Reinforcement Learning. (arXiv:2112.01575v2 [cs.AI] UPDATED)
    (0 min) Reinforcement learning (RL) and brain-computer interfaces (BCI) are two fields that have been growing over the past decade. Until recently, these fields have operated independently of one another. With the rising interest in human-in-the-loop (HITL) applications, RL algorithms have been adapted to account for human guidance giving rise to the sub-field of interactive reinforcement learning (IRL). Adjacently, BCI applications have been long interested in extracting intrinsic feedback from neural activity during human-computer interactions. These two ideas have set RL and BCI on a collision course for one another through the integration of BCI into the IRL framework where intrinsic feedback can be utilized to help train an agent. This intersection has created a new and emerging paradigm denoted as intrinsic IRL. To further help facilitate deeper ingratiation of BCI and IRL, we provide a tutorial and review of intrinsic IRL so far with an emphasis on its parent field of feedback-driven IRL along with discussions concerning validity, challenges, and open problems.
    Invariance encoding in sliced-Wasserstein space for image classification with limited training data. (arXiv:2201.02980v1 [cs.CV])
    (0 min) Deep convolutional neural networks (CNNs) are broadly considered to be state-of-the-art generic end-to-end image classification systems. However, they are known to underperform when training data are limited and thus require data augmentation strategies that render the method computationally expensive and not always effective. Rather than using a data augmentation strategy to encode invariances as typically done in machine learning, here we propose to mathematically augment a nearest subspace classification model in sliced-Wasserstein space by exploiting certain mathematical properties of the Radon Cumulative Distribution Transform (R-CDT), a recently introduced image transform. We demonstrate that for a particular type of learning problem, our mathematical solution has advantages over data augmentation with deep CNNs in terms of classification accuracy and computational complexity, and is particularly effective under a limited training data setting. The method is simple, effective, computationally efficient, non-iterative, and requires no parameters to be tuned. Python code implementing our method is available at https://github.com/rohdelab/mathematical_augmentation. Our method is integrated as a part of the software package PyTransKit, which is available at https://github.com/rohdelab/PyTransKit.
    Development of a hybrid machine-learning and optimization tool for performance-based solar shading design. (arXiv:2201.03028v1 [cs.LG])
    (0 min) Solar shading design should be done for the desired Indoor Environmental Quality (IEQ) in the early design stages. This field can be very challenging and time-consuming also requires experts, sophisticated software, and a large amount of money. The primary purpose of this research is to design a simple tool to study various models of solar shadings and make decisions easier and faster in the early stages. Database generation methods, artificial intelligence, and optimization have been used to achieve this goal. This tool includes two main parts of 1. predicting the performance of the user-selected model along with proposing effective parameters and 2. proposing optimal pre-prepared models to the user. In this regard, initially, a side-lit shoebox model with variable parameters was modeled parametrically, and five common solar shading models with their variables were applied to the space. For each solar shadings and the state without shading, metrics related to daylight and glare, view, and initial costs were simulated. The database generated in this research includes 87912 alternatives and six calculated metrics introduced to optimized machine learning models, including neural network, random Forrest, support vector regression, and k nearest neighbor. According to the results, the most accurate and fastest estimation model was Random Forrest, with an r2_score of 0.967 to 1. Then, sensitivity analysis was performed to identify the most influential parameters for each shading model and the state without it. This analysis distinguished the most effective parameters, including window orientation, WWR, room width, length, and shading depth. Finally, by optimizing the estimation function of machine learning models with the NSGA II algorithm, about 7300 optimal models were identified. The developed tool can evaluate various design alternatives in less than a few seconds for each.
    Task2Sim : Towards Effective Pre-training and Transfer from Synthetic Data. (arXiv:2112.00054v2 [cs.CV] UPDATED)
    (0 min) Pre-training models on Imagenet or other massive datasets of real images has led to major advances in computer vision, albeit accompanied with shortcomings related to curation cost, privacy, usage rights, and ethical issues. In this paper, for the first time, we study the transferability of pre-trained models based on synthetic data generated by graphics simulators to downstream tasks from very different domains. In using such synthetic data for pre-training, we find that downstream performance on different tasks are favored by different configurations of simulation parameters (e.g. lighting, object pose, backgrounds, etc.), and that there is no one-size-fits-all solution. It is thus better to tailor synthetic pre-training data to a specific downstream task, for best performance. We introduce Task2Sim, a unified model mapping downstream task representations to optimal simulation parameters to generate synthetic pre-training data for them. Task2Sim learns this mapping by training to find the set of best parameters on a set of "seen" tasks. Once trained, it can then be used to predict best simulation parameters for novel "unseen" tasks in one shot, without requiring additional training. Given a budget in number of images per class, our extensive experiments with 20 diverse downstream tasks show Task2Sim's task-adaptive pre-training data results in significantly better downstream performance than non-adaptively choosing simulation parameters on both seen and unseen tasks. It is even competitive with pre-training on real images from Imagenet.
    Lung infection and normal region segmentation from CT volumes of COVID-19 cases. (arXiv:2201.03050v1 [eess.IV])
    (0 min) This paper proposes an automated segmentation method of infection and normal regions in the lung from CT volumes of COVID-19 patients. From December 2019, novel coronavirus disease 2019 (COVID-19) spreads over the world and giving significant impacts to our economic activities and daily lives. To diagnose the large number of infected patients, diagnosis assistance by computers is needed. Chest CT is effective for diagnosis of viral pneumonia including COVID-19. A quantitative analysis method of condition of the lung from CT volumes by computers is required for diagnosis assistance of COVID-19. This paper proposes an automated segmentation method of infection and normal regions in the lung from CT volumes using a COVID-19 segmentation fully convolutional network (FCN). In diagnosis of lung diseases including COVID-19, analysis of conditions of normal and infection regions in the lung is important. Our method recognizes and segments lung normal and infection regions in CT volumes. To segment infection regions that have various shapes and sizes, we introduced dense pooling connections and dilated convolutions in our FCN. We applied the proposed method to CT volumes of COVID-19 cases. From mild to severe cases of COVID-19, the proposed method correctly segmented normal and infection regions in the lung. Dice scores of normal and infection regions were 0.911 and 0.753, respectively.
    The Logic of Graph Neural Networks. (arXiv:2104.14624v2 [cs.LG] UPDATED)
    (0 min) Graph neural networks (GNNs) are deep learning architectures for machine learning problems on graphs. It has recently been shown that the expressiveness of GNNs can be characterised precisely by the combinatorial Weisfeiler-Leman algorithms and by finite variable counting logics. The correspondence has even led to new, higher-order GNNs corresponding to the WL algorithm in higher dimensions. The purpose of this paper is to explain these descriptive characterisations of GNNs.
    Differentiable and Scalable Generative Adversarial Models for Data Imputation. (arXiv:2201.03202v1 [cs.LG])
    (0 min) Data imputation has been extensively explored to solve the missing data problem. The dramatically increasing volume of incomplete data makes the imputation models computationally infeasible in many real-life applications. In this paper, we propose an effective scalable imputation system named SCIS to significantly speed up the training of the differentiable generative adversarial imputation models under accuracy-guarantees for large-scale incomplete data. SCIS consists of two modules, differentiable imputation modeling (DIM) and sample size estimation (SSE). DIM leverages a new masking Sinkhorn divergence function to make an arbitrary generative adversarial imputation model differentiable, while for such a differentiable imputation model, SSE can estimate an appropriate sample size to ensure the user-specified imputation accuracy of the final model. Extensive experiments upon several real-life large-scale datasets demonstrate that, our proposed system can accelerate the generative adversarial model training by 7.1x. Using around 7.6% samples, SCIS yields competitive accuracy with the state-of-the-art imputation methods in a much shorter computation time.
    A Study on Mitigating Hard Boundaries of Decision-Tree-based Uncertainty Estimates for AI Models. (arXiv:2201.03263v1 [cs.LG])
    (0 min) Outcomes of data-driven AI models cannot be assumed to be always correct. To estimate the uncertainty in these outcomes, the uncertainty wrapper framework has been proposed, which considers uncertainties related to model fit, input quality, and scope compliance. Uncertainty wrappers use a decision tree approach to cluster input quality related uncertainties, assigning inputs strictly to distinct uncertainty clusters. Hence, a slight variation in only one feature may lead to a cluster assignment with a significantly different uncertainty. Our objective is to replace this with an approach that mitigates hard decision boundaries of these assignments while preserving interpretability, runtime complexity, and prediction performance. Five approaches were selected as candidates and integrated into the uncertainty wrapper framework. For the evaluation based on the Brier score, datasets for a pedestrian detection use case were generated using the CARLA simulator and YOLOv3. All integrated approaches achieved a softening, i.e., smoothing, of uncertainty estimation. Yet, compared to decision trees, they are not so easy to interpret and have higher runtime complexity. Moreover, some components of the Brier score impaired while others improved. Most promising regarding the Brier score were random forests. In conclusion, softening hard decision tree boundaries appears to be a trade-off decision.
    AutoDEUQ: Automated Deep Ensemble with Uncertainty Quantification. (arXiv:2110.13511v2 [cs.LG] UPDATED)
    (0 min) Deep neural networks are powerful predictors for a variety of tasks. However, they do not capture uncertainty directly. Using neural network ensembles to quantify uncertainty is competitive with approaches based on Bayesian neural networks while benefiting from better computational scalability. However, building ensembles of neural networks is a challenging task because, in addition to choosing the right neural architecture or hyperparameters for each member of the ensemble, there is an added cost of training each model. We propose AutoDEUQ, an automated approach for generating an ensemble of deep neural networks. Our approach leverages joint neural architecture and hyperparameter search to generate ensembles. We use the law of total variance to decompose the predictive variance of deep ensembles into aleatoric (data) and epistemic (model) uncertainties. We show that AutoDEUQ outperforms probabilistic backpropagation, Monte Carlo dropout, deep ensemble, distribution-free ensembles, and hyper ensemble methods on a number of regression benchmarks.
    Non-Asymptotic Guarantees for Robust Statistical Learning under $(1+\varepsilon)$-th Moment Assumption. (arXiv:2201.03182v1 [stat.ML])
    (0 min) There has been a surge of interest in developing robust estimators for models with heavy-tailed data in statistics and machine learning. This paper proposes a log-truncated M-estimator for a large family of statistical regressions and establishes its excess risk bound under the condition that the data have $(1+\varepsilon)$-th moment with $\varepsilon \in (0,1]$. With an additional assumption on the associated risk function, we obtain an $\ell_2$-error bound for the estimation. Our theorems are applied to establish robust M-estimators for concrete regressions. Besides convex regressions such as quantile regression and generalized linear models, many non-convex regressions can also be fit into our theorems, we focus on robust deep neural network regressions, which can be solved by the stochastic gradient descent algorithms. Simulations and real data analysis demonstrate the superiority of log-truncated estimations over standard estimations.
    Loss-calibrated expectation propagation for approximate Bayesian decision-making. (arXiv:2201.03128v1 [stat.ML])
    (0 min) Approximate Bayesian inference methods provide a powerful suite of tools for finding approximations to intractable posterior distributions. However, machine learning applications typically involve selecting actions, which -- in a Bayesian setting -- depend on the posterior distribution only via its contribution to expected utility. A growing body of work on loss-calibrated approximate inference methods has therefore sought to develop posterior approximations sensitive to the influence of the utility function. Here we introduce loss-calibrated expectation propagation (Loss-EP), a loss-calibrated variant of expectation propagation. This method resembles standard EP with an additional factor that "tilts" the posterior towards higher-utility decisions. We show applications to Gaussian process classification under binary utility functions with asymmetric penalties on False Negative and False Positive errors, and show how this asymmetry can have dramatic consequences on what information is "useful" to capture in an approximation.
    Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of Language Models. (arXiv:2111.02840v2 [cs.CL] UPDATED)
    (0 min) Large-scale pre-trained language models have achieved tremendous success across a wide range of natural language understanding (NLU) tasks, even surpassing human performance. However, recent studies reveal that the robustness of these models can be challenged by carefully crafted textual adversarial examples. While several individual datasets have been proposed to evaluate model robustness, a principled and comprehensive benchmark is still missing. In this paper, we present Adversarial GLUE (AdvGLUE), a new multi-task benchmark to quantitatively and thoroughly explore and evaluate the vulnerabilities of modern large-scale language models under various types of adversarial attacks. In particular, we systematically apply 14 textual adversarial attack methods to GLUE tasks to construct AdvGLUE, which is further validated by humans for reliable annotations. Our findings are summarized as follows. (i) Most existing adversarial attack algorithms are prone to generating invalid or ambiguous adversarial examples, with around 90% of them either changing the original semantic meanings or misleading human annotators as well. Therefore, we perform a careful filtering process to curate a high-quality benchmark. (ii) All the language models and robust training methods we tested perform poorly on AdvGLUE, with scores lagging far behind the benign accuracy. We hope our work will motivate the development of new adversarial attacks that are more stealthy and semantic-preserving, as well as new robust language models against sophisticated adversarial attacks. AdvGLUE is available at https://adversarialglue.github.io.
    Action Transformer: A Self-Attention Model for Short-Time Pose-Based Human Action Recognition. (arXiv:2107.00606v6 [cs.CV] UPDATED)
    (0 min) Deep neural networks based purely on attention have been successful across several domains, relying on minimal architectural priors from the designer. In Human Action Recognition (HAR), attention mechanisms have been primarily adopted on top of standard convolutional or recurrent layers, improving the overall generalization capability. In this work, we introduce Action Transformer (AcT), a simple, fully self-attentional architecture that consistently outperforms more elaborated networks that mix convolutional, recurrent and attentive layers. In order to limit computational and energy requests, building on previous human action recognition research, the proposed approach exploits 2D pose representations over small temporal windows, providing a low latency solution for accurate and effective real-time performance. Moreover, we open-source MPOSE2021, a new large-scale dataset, as an attempt to build a formal training and evaluation benchmark for real-time, short-time HAR. The proposed methodology was extensively tested on MPOSE2021 and compared to several state-of-the-art architectures, proving the effectiveness of the AcT model and laying the foundations for future work on HAR.
    Improving Fairness for Data Valuation in Federated Learning. (arXiv:2109.09046v2 [cs.LG] UPDATED)
    (0 min) Federated learning is an emerging decentralized machine learning scheme that allows multiple data owners to work collaboratively while ensuring data privacy. The success of federated learning depends largely on the participation of data owners. To sustain and encourage data owners' participation, it is crucial to fairly evaluate the quality of the data provided by the data owners and reward them correspondingly. Federated Shapley value, recently proposed by Wang et al. [Federated Learning, 2020], is a measure for data value under the framework of federated learning that satisfies many desired properties for data valuation. However, there are still factors of potential unfairness in the design of federated Shapley value because two data owners with the same local data may not receive the same evaluation. We propose a new measure called completed federated Shapley value to improve the fairness of federated Shapley value. The design depends on completing a matrix consisting of all the possible contributions by different subsets of the data owners. It is shown under mild conditions that this matrix is approximately low-rank by leveraging concepts and tools from optimization. Both theoretical analysis and empirical evaluation verify that the proposed measure does improve fairness in many circumstances.
    A Survey on Face Recognition Systems. (arXiv:2201.02991v1 [cs.CV])
    (0 min) Face Recognition has proven to be one of the most successful technology and has impacted heterogeneous domains. Deep learning has proven to be the most successful at computer vision tasks because of its convolution-based architecture. Since the advent of deep learning, face recognition technology has had a substantial increase in its accuracy. In this paper, some of the most impactful face recognition systems were surveyed. Firstly, the paper gives an overview of a general face recognition system. Secondly, the survey covers various network architectures and training losses that have had a substantial impact. Finally, the paper talks about various databases that are used to evaluate the capabilities of a face recognition system.
    Video-based fully automatic assessment of open surgery suturing skills. (arXiv:2110.13972v2 [cs.CV] UPDATED)
    (0 min) The goal of this study was to develop new reliable open surgery suturing simulation system for training medical students in situation where resources are limited or in the domestic setup. Namely, we developed an algorithm for tools and hands localization as well as identifying the interactions between them based on simple webcam video data, calculating motion metrics for assessment of surgical skill. Twenty-five participants performed multiple suturing tasks using our simulator. The YOLO network has been modified to a multi-task network, for the purpose of tool localization and tool-hand interaction detection. This was accomplished by splitting the YOLO detection heads so that they supported both tasks with minimal addition to computer run-time. Furthermore, based on the outcome of the system, motion metrics were calculated. These metrics included traditional metrics such as time and path length as well as new metrics assessing the technique participants use for holding the tools. The dual-task network performance was similar to that of two networks, while computational load was only slightly bigger than one network. In addition, the motion metrics showed significant differences between experts and novices. While video capture is an essential part of minimally invasive surgery, it is not an integral component of open surgery. Thus, new algorithms, focusing on the unique challenges open surgery videos present, are required. In this study, a dual-task network was developed to solve both a localization task and a hand-tool interaction task. The dual network may be easily expanded to a multi-task network, which may be useful for images with multiple layers and for evaluating the interaction between these different layers.
    Preserving Domain Private Representation via Mutual Information Maximization. (arXiv:2201.03102v1 [cs.LG])
    (0 min) Recent advances in unsupervised domain adaptation have shown that mitigating the domain divergence by extracting the domain-invariant representation could significantly improve the generalization of a model to an unlabeled data domain. Nevertheless, the existing methods fail to effectively preserve the representation that is private to the label-missing domain, which could adversely affect the generalization. In this paper, we propose an approach to preserve such representation so that the latent distribution of the unlabeled domain could represent both the domain-invariant features and the individual characteristics that are private to the unlabeled domain. In particular, we demonstrate that maximizing the mutual information between the unlabeled domain and its latent space while mitigating the domain divergence can achieve such preservation. We also theoretically and empirically validate that preserving the representation that is private to the unlabeled domain is important and of necessity for the cross-domain generalization. Our approach outperforms state-of-the-art methods on several public datasets.
    Predictions of Reynolds and Nusselt numbers in turbulent convection using machine-learning models. (arXiv:2201.03200v1 [physics.flu-dyn])
    (0 min) In this paper, we develop a multivariate regression model and a neural network model to predict the Reynolds number (Re) and Nusselt number in turbulent thermal convection. We compare their predictions with those of earlier models of convection: Grossmann-Lohse~[Phys. Rev. Lett. \textbf{86}, 3316 (2001)], revised Grossmann-Lohse~[Phys. Fluids \textbf{33}, 015113 (2021)], and Pandey-Verma [Phys. Rev. E \textbf{94}, 053106 (2016)] models. We observe that although the predictions of all the models are quite close to each other, the machine learning models developed in this work provide the best match with the experimental and numerical results.
    Privacy-aware Early Detection of COVID-19 through Adversarial Training. (arXiv:2201.03004v1 [cs.LG])
    (0 min) Early detection of COVID-19 is an ongoing area of research that can help with triage, monitoring and general health assessment of potential patients and may reduce operational strain on hospitals that cope with the coronavirus pandemic. Different machine learning techniques have been used in the literature to detect coronavirus using routine clinical data (blood tests, and vital signs). Data breaches and information leakage when using these models can bring reputational damage and cause legal issues for hospitals. In spite of this, protecting healthcare models against leakage of potentially sensitive information is an understudied research area. In this work, we examine two machine learning approaches, intended to predict a patient's COVID-19 status using routinely collected and readily available clinical data. We employ adversarial training to explore robust deep learning architectures that protect attributes related to demographic information about the patients. The two models we examine in this work are intended to preserve sensitive information against adversarial attacks and information leakage. In a series of experiments using datasets from the Oxford University Hospitals, Bedfordshire Hospitals NHS Foundation Trust, University Hospitals Birmingham NHS Foundation Trust, and Portsmouth Hospitals University NHS Trust we train and test two neural networks that predict PCR test results using information from basic laboratory blood tests, and vital signs performed on a patients' arrival to hospital. We assess the level of privacy each one of the models can provide and show the efficacy and robustness of our proposed architectures against a comparable baseline. One of our main contributions is that we specifically target the development of effective COVID-19 detection models with built-in mechanisms in order to selectively protect sensitive attributes against adversarial attacks.
    Reframing Neural Networks: Deep Structure in Overcomplete Representations. (arXiv:2103.05804v2 [cs.LG] UPDATED)
    (0 min) In comparison to classical shallow representation learning techniques, deep neural networks have achieved superior performance in nearly every application benchmark. But despite their clear empirical advantages, it is still not well understood what makes them so effective. To approach this question, we introduce deep frame approximation: a unifying framework for constrained representation learning with structured overcomplete frames. While exact inference requires iterative optimization, it may be approximated by the operations of a feed-forward deep neural network. We indirectly analyze how model capacity relates to frame structures induced by architectural hyperparameters such as depth, width, and skip connections. We quantify these structural differences with the deep frame potential, a data-independent measure of coherence linked to representation uniqueness and stability. As a criterion for model selection, we show correlation with generalization error on a variety of common deep network architectures and datasets. We also demonstrate how recurrent networks implementing iterative optimization algorithms can achieve performance comparable to their feed-forward approximations while improving adversarial robustness. This connection to the established theory of overcomplete representations suggests promising new directions for principled deep network architecture design with less reliance on ad-hoc engineering.
    Learning Meta Representations for Agents in Multi-Agent Reinforcement Learning. (arXiv:2108.12988v2 [cs.LG] UPDATED)
    (0 min) In multi-agent reinforcement learning, the behaviors that agents learn in a single Markov Game (MG) are typically confined to the given agent number (i.e., population size). Every single MG induced by varying population sizes may possess distinct optimal joint strategies and game-specific knowledge, which are modeled independently in modern multi-agent algorithms. In this work, we focus on creating agents that generalize across population-varying MGs. Instead of learning a unimodal policy, each agent learns a policy set that is formed by effective strategies across a variety of games. We propose Meta Representations for Agents (MRA) that explicitly models the game-common and game-specific strategic knowledge. By representing the policy sets with multi-modal latent policies, the common strategic knowledge and diverse strategic modes are discovered with an iterative optimization procedure. We prove that as an approximation to a constrained mutual information maximization objective, the learned policies can reach Nash Equilibrium in every evaluation MG under the assumption of Lipschitz game on a sufficiently large latent space. When deploying it at practical latent models with limited size, fast adaptation can be achieved by leveraging the first-order gradient information. Extensive experiments show the effectiveness of MRA on both training performance and generalization ability in hard and unseen games.
    Integration of Explainable Artificial Intelligence to Identify Significant Landslide Causal Factors for Extreme Gradient Boosting based Landslide Susceptibility Mapping with Improved Feature Selection. (arXiv:2201.03225v1 [cs.LG])
    (0 min) Landslides have been a regular occurrence and an alarming threat to human life and property in the era of anthropogenic global warming. An early prediction of landslide susceptibility using a data-driven approach is a demand of time. In this study, we explored the eloquent features that best describe landslide susceptibility with state-of-the-art machine learning methods. In our study, we employed state-of-the-art machine learning algorithms including XgBoost, LR, KNN, SVM, Adaboost for landslide susceptibility prediction. To find the best hyperparameters of each individual classifier for optimized performance, we have incorporated the Grid Search method, with 10 Fold Cross-Validation. In this context, the optimized version of XgBoost outperformed all other classifiers with a Cross-validation Weighted F1 score of 94.62%. Followed by this empirical evidence, we explored the XgBoost classifier by incorporating TreeSHAP and identified eloquent features such as SLOPE, ELEVATION, TWI that complement the performance of the XGBoost classifier mostly and features such as LANDUSE, NDVI, SPI which has less effect on models performance. According to the TreeSHAP explanation of features, we selected the 9 most significant landslide causal factors out of 15. Evidently, an optimized version of XgBoost along with feature reduction by 40%, has outperformed all other classifiers in terms of popular evaluation metrics with a Cross-Validation Weighted F1 score of 95.01% on the training and AUC score of 97%.
    Robust and Resource-Efficient Data-Free Knowledge Distillation by Generative Pseudo Replay. (arXiv:2201.03019v1 [cs.LG])
    (0 min) Data-Free Knowledge Distillation (KD) allows knowledge transfer from a trained neural network (teacher) to a more compact one (student) in the absence of original training data. Existing works use a validation set to monitor the accuracy of the student over real data and report the highest performance throughout the entire process. However, validation data may not be available at distillation time either, making it infeasible to record the student snapshot that achieved the peak accuracy. Therefore, a practical data-free KD method should be robust and ideally provide monotonically increasing student accuracy during distillation. This is challenging because the student experiences knowledge degradation due to the distribution shift of the synthetic data. A straightforward approach to overcome this issue is to store and rehearse the generated samples periodically, which increases the memory footprint and creates privacy concerns. We propose to model the distribution of the previously observed synthetic samples with a generative network. In particular, we design a Variational Autoencoder (VAE) with a training objective that is customized to learn the synthetic data representations optimally. The student is rehearsed by the generative pseudo replay technique, with samples produced by the VAE. Hence knowledge degradation can be prevented without storing any samples. Experiments on image classification benchmarks show that our method optimizes the expected value of the distilled model accuracy while eliminating the large memory overhead incurred by the sample-storing methods.
    HAWKS: Evolving Challenging Benchmark Sets for Cluster Analysis. (arXiv:2102.06940v2 [cs.NE] UPDATED)
    (0 min) Comprehensive benchmarking of clustering algorithms is rendered difficult by two key factors: (i)~the elusiveness of a unique mathematical definition of this unsupervised learning approach and (ii)~dependencies between the generating models or clustering criteria adopted by some clustering algorithms and indices for internal cluster validation. Consequently, there is no consensus regarding the best practice for rigorous benchmarking, and whether this is possible at all outside the context of a given application. Here, we argue that synthetic datasets must continue to play an important role in the evaluation of clustering algorithms, but that this necessitates constructing benchmarks that appropriately cover the diverse set of properties that impact clustering algorithm performance. Through our framework, HAWKS, we demonstrate the important role evolutionary algorithms play to support flexible generation of such benchmarks, allowing simple modification and extension. We illustrate two possible uses of our framework: (i)~the evolution of benchmark data consistent with a set of hand-derived properties and (ii)~the generation of datasets that tease out performance differences between a given pair of algorithms. Our work has implications for the design of clustering benchmarks that sufficiently challenge a broad range of algorithms, and for furthering insight into the strengths and weaknesses of specific approaches.
    Noisy Neonatal Chest Sound Separation for High-Quality Heart and Lung Sounds. (arXiv:2201.03211v1 [eess.AS])
    (0 min) Stethoscope-recorded chest sounds provide the opportunity for remote cardio-respiratory health monitoring of neonates. However, reliable monitoring requires high-quality heart and lung sounds. This paper presents novel Non-negative Matrix Factorisation (NMF) and Non-negative Matrix Co-Factorisation (NMCF) methods for neonatal chest sound separation. To assess these methods and compare with existing single-source separation methods, an artificial mixture dataset was generated comprising of heart, lung and noise sounds. Signal-to-noise ratios were then calculated for these artificial mixtures. These methods were also tested on real-world noisy neonatal chest sounds and assessed based on vital sign estimation error and a signal quality score of 1-5 developed in our previous works. Additionally, the computational cost of all methods was assessed to determine the applicability for real-time processing. Overall, both the proposed NMF and NMCF methods outperform the next best existing method by 2.7dB to 11.6dB for the artificial dataset and 0.40 to 1.12 signal quality improvement for the real-world dataset. The median processing time for the sound separation of a 10s recording was found to be 28.3s for NMCF and 342ms for NMF. Because of stable and robust performance, we believe that our proposed methods are useful to denoise neonatal heart and lung sound in a real-world environment. Codes for proposed and existing methods can be found at: https://github.com/egrooby-monash/Heart-and-Lung-Sound-Separation.
    Online Deterministic Annealing for Classification and Clustering. (arXiv:2102.05836v4 [cs.LG] UPDATED)
    (0 min) Inherent in virtually every iterative machine learning algorithm is the problem of hyper-parameter tuning, which includes three major design parameters: (a) the complexity of the model, e.g., the number of neurons in a neural network, (b) the initial conditions, which heavily affect the behavior of the algorithm, and (c) the dissimilarity measure used to quantify its performance. We introduce an online prototype-based learning algorithm that can be viewed as a progressively growing competitive-learning neural network architecture for classification and clustering. The learning rule of the proposed approach is formulated as an online gradient-free stochastic approximation algorithm that solves a sequence of appropriately defined optimization problems, simulating an annealing process. The annealing nature of the algorithm contributes to avoiding poor local minima, offers robustness with respect to the initial conditions, and provides a means to progressively increase the complexity of the learning model, through an intuitive bifurcation phenomenon. The proposed approach is interpretable, requires minimal hyper-parameter tuning, and allows online control over the performance-complexity trade-off. Finally, we show that Bregman divergences appear naturally as a family of dissimilarity measures that play a central role in both the performance and the computational complexity of the learning algorithm.
    Information-Theoretic Bias Reduction via Causal View of Spurious Correlation. (arXiv:2201.03121v1 [cs.LG])
    (0 min) We propose an information-theoretic bias measurement technique through a causal interpretation of spurious correlation, which is effective to identify the feature-level algorithmic bias by taking advantage of conditional mutual information. Although several bias measurement methods have been proposed and widely investigated to achieve algorithmic fairness in various tasks such as face recognition, their accuracy- or logit-based metrics are susceptible to leading to trivial prediction score adjustment rather than fundamental bias reduction. Hence, we design a novel debiasing framework against the algorithmic bias, which incorporates a bias regularization loss derived by the proposed information-theoretic bias measurement approach. In addition, we present a simple yet effective unsupervised debiasing technique based on stochastic label noise, which does not require the explicit supervision of bias information. The proposed bias measurement and debiasing approaches are validated in diverse realistic scenarios through extensive experiments on multiple standard benchmarks.
    Graph Representation Learning for Multi-Task Settings: a Meta-Learning Approach. (arXiv:2201.03326v1 [cs.LG])
    (0 min) Graph Neural Networks (GNNs) have become the state-of-the-art method for many applications on graph structured data. GNNs are a framework for graph representation learning, where a model learns to generate low dimensional node embeddings that encapsulate structural and feature-related information. GNNs are usually trained in an end-to-end fashion, leading to highly specialized node embeddings. While this approach achieves great results in the single-task setting, generating node embeddings that can be used to perform multiple tasks (with performance comparable to single-task models) is still an open problem. We propose a novel training strategy for graph representation learning, based on meta-learning, which allows the training of a GNN model capable of producing multi-task node embeddings. Our method avoids the difficulties arising when learning to perform multiple tasks concurrently by, instead, learning to quickly (i.e. with a few steps of gradient descent) adapt to multiple tasks singularly. We show that the embeddings produced by a model trained with our method can be used to perform multiple tasks with comparable or, surprisingly, even higher performance than both single-task and multi-task end-to-end models.
    Enhancing Haptic Distinguishability of Surface Materials with Boosting Technique. (arXiv:2010.02002v4 [cs.LG] UPDATED)
    (0 min) Discriminative features are crucial for several learning applications, such as object detection and classification. Neural networks are extensively used for extracting discriminative features of images and speech signals. However, the lack of large datasets in the haptics domain often limits the applicability of such techniques. This paper presents a general framework for the analysis of the discriminative properties of haptic signals. We demonstrate the effectiveness of spectral features and a boosted embedding technique in enhancing the distinguishability of haptic signals. Experiments indicate our framework needs less training data, generalizes well for different predictors, and outperforms the related state-of-the-art.
    Topographic VAEs learn Equivariant Capsules. (arXiv:2109.01394v2 [cs.LG] UPDATED)
    (0 min) In this work we seek to bridge the concepts of topographic organization and equivariance in neural networks. To accomplish this, we introduce the Topographic VAE: a novel method for efficiently training deep generative models with topographically organized latent variables. We show that such a model indeed learns to organize its activations according to salient characteristics such as digit class, width, and style on MNIST. Furthermore, through topographic organization over time (i.e. temporal coherence), we demonstrate how predefined latent space transformation operators can be encouraged for observed transformed input sequences -- a primitive form of unsupervised learned equivariance. We demonstrate that this model successfully learns sets of approximately equivariant features (i.e. "capsules") directly from sequences and achieves higher likelihood on correspondingly transforming test sequences. Equivariance is verified quantitatively by measuring the approximate commutativity of the inference network and the sequence transformations. Finally, we demonstrate approximate equivariance to complex transformations, expanding upon the capabilities of existing group equivariant neural networks.
    Understanding Layer-wise Contributions in Deep Neural Networks through Spectral Analysis. (arXiv:2111.03972v3 [cs.LG] UPDATED)
    (0 min) Spectral analysis is a powerful tool, decomposing any function into simpler parts. In machine learning, Mercer's theorem generalizes this idea, providing for any kernel and input distribution a natural basis of functions of increasing frequency. More recently, several works have extended this analysis to deep neural networks through the framework of Neural Tangent Kernel. In this work, we analyze the layer-wise spectral bias of Deep Neural Networks and relate it to the contributions of different layers in the reduction of generalization error for a given target function. We utilize the properties of Hermite polynomials and Spherical Harmonics to prove that initial layers exhibit a larger bias towards high-frequency functions defined on the unit sphere. We further provide empirical results validating our theory in high dimensional datasets for Deep Neural Networks.
    Modeling Historical AIS Data For Vessel Path Prediction: A Comprehensive Treatment. (arXiv:2001.01592v3 [cs.AI] UPDATED)
    (0 min) The prosperity of artificial intelligence has aroused intensive interests in intelligent/autonomous navigation, in which path prediction is a key functionality for decision supports, e.g. route planning, collision warning, and traffic regulation. For maritime intelligence, Automatic Identification System (AIS) plays an important role because it recently has been made compulsory for large international commercial vessels and is able to provide nearly real-time information of the vessel. Therefore AIS data based vessel path prediction is a promising way in future maritime intelligence. However, real-world AIS data collected online are just highly irregular trajectory segments (AIS message sequences) from different types of vessels and geographical regions, with possibly very low data quality. So even there are some works studying how to build a path prediction model using historical AIS data, but still, it is a very challenging problem. In this paper, we propose a comprehensive framework to model massive historical AIS trajectory segments for accurate vessel path prediction. Experimental comparisons with existing popular methods are made to validate the proposed approach and results show that our approach could outperform the baseline methods by a wide margin.
    $m^\ast$ of two-dimensional electron gas: a neural canonical transformation study. (arXiv:2201.03156v1 [cond-mat.stat-mech])
    (0 min) The quasiparticle effective mass $m^\ast$ of interacting electrons is a fundamental quantity in the Fermi liquid theory. However, the precise value of the effective mass of uniform electron gas is still elusive after decades of research. The newly developed neural canonical transformation approach arXiv:2105.08644 offers a principled way to extract the effective mass of electron gas by directly calculating the thermal entropy at low temperature. The approach models a variational many-electron density matrix using two generative neural networks: an autoregressive model for momentum occupation and a normalizing flow for electron coordinates. Our calculation reveals a suppression of effective mass in the two-dimensional spin-polarized electron gas, which is more pronounced than previous reports in the low-density strong-coupling region. This prediction calls for verification in two-dimensional electron gas experiments.
    GridTuner: Reinvestigate Grid Size Selection for Spatiotemporal Prediction Models [Technical Report]. (arXiv:2201.03244v1 [cs.DB])
    (0 min) With the development of traffic prediction technology, spatiotemporal prediction models have attracted more and more attention from academia communities and industry. However, most existing researches focus on reducing model's prediction error but ignore the error caused by the uneven distribution of spatial events within a region. In this paper, we study a region partitioning problem, namely optimal grid size selection problem (OGSS), which aims to minimize the real error of spatiotemporal prediction models by selecting the optimal grid size. In order to solve OGSS, we analyze the upper bound of real error of spatiotemporal prediction models and minimize the real error by minimizing its upper bound. Through in-depth analysis, we find that the upper bound of real error will decrease then increase when the number of model grids increase from 1 to the maximum allowed value. Then, we propose two algorithms, namely Ternary Search and Iterative Method, to automatically find the optimal grid size. Finally, the experiments verify that the error of prediction has the same trend as its upper bound, and the change trend of the upper bound of real error with respect to the increase of the number of model grids will decrease then increase. Meanwhile, in a case study, by selecting the optimal grid size, the order dispatching results of a state-of-the-art prediction-based algorithm can be improved up to 13.6%, which shows the effectiveness of our methods on tuning the region partition for spatiotemporal prediction models.
    Scaling Knowledge Graph Embedding Models. (arXiv:2201.02791v1 [cs.LG])
    (0 min) Developing scalable solutions for training Graph Neural Networks (GNNs) for link prediction tasks is challenging due to the high data dependencies which entail high computational cost and huge memory footprint. We propose a new method for scaling training of knowledge graph embedding models for link prediction to address these challenges. Towards this end, we propose the following algorithmic strategies: self-sufficient partitions, constraint-based negative sampling, and edge mini-batch training. Both, partitioning strategy and constraint-based negative sampling, avoid cross partition data transfer during training. In our experimental evaluation, we show that our scaling solution for GNN-based knowledge graph embedding models achieves a 16x speed up on benchmark datasets while maintaining a comparable model performance as non-distributed methods on standard metrics.
    DataLens: Scalable Privacy Preserving Training via Gradient Compression and Aggregation. (arXiv:2103.11109v5 [cs.LG] UPDATED)
    (0 min) Recent success of deep neural networks (DNNs) hinges on the availability of large-scale dataset; however, training on such dataset often poses privacy risks for sensitive training information. In this paper, we aim to explore the power of generative models and gradient sparsity, and propose a scalable privacy-preserving generative model DATALENS. Comparing with the standard PATE privacy-preserving framework which allows teachers to vote on one-dimensional predictions, voting on the high dimensional gradient vectors is challenging in terms of privacy preservation. As dimension reduction techniques are required, we need to navigate a delicate tradeoff space between (1) the improvement of privacy preservation and (2) the slowdown of SGD convergence. To tackle this, we take advantage of communication efficient learning and propose a novel noise compression and aggregation approach TOPAGG by combining top-k compression for dimension reduction with a corresponding noise injection mechanism. We theoretically prove that the DATALENS framework guarantees differential privacy for its generated data, and provide analysis on its convergence. To demonstrate the practical usage of DATALENS, we conduct extensive experiments on diverse datasets including MNIST, Fashion-MNIST, and high dimensional CelebA, and we show that, DATALENS significantly outperforms other baseline DP generative models. In addition, we adapt the proposed TOPAGG approach, which is one of the key building blocks in DATALENS, to DP SGD training, and show that it is able to achieve higher utility than the state-of-the-art DP SGD approach in most cases. Our code is publicly available at https://github.com/AI-secure/DataLens.
    Local Information Assisted Attention-free Decoder for Audio Captioning. (arXiv:2201.03217v1 [cs.SD])
    (0 min) Automated audio captioning (AAC) aims to describe audio data with captions using natural language. Most existing AAC methods adopt an encoder-decoder structure, where the attention based mechanism is a popular choice in the decoder (e.g., Transformer decoder) for predicting captions from audio features. Such attention based decoders can capture the global information from the audio features, however, their ability in extracting local information can be limited, which may lead to degraded quality in the generated captions. In this paper, we present an AAC method with an attention-free decoder, where an encoder based on PANNs is employed for audio feature extraction, and the attention-free decoder is designed to introduce local information. The proposed method enables the effective use of both global and local information from audio signals. Experiments show that our method outperforms the state-of-the-art methods with the standard attention based decoder in Task 6 of the DCASE 2021 Challenge.
    Opportunities of Hybrid Model-based Reinforcement Learning for Cell Therapy Manufacturing Process Development and Control. (arXiv:2201.03116v1 [eess.SY])
    (0 min) Driven by the key challenges of cell therapy manufacturing, including high complexity, high uncertainty, and very limited process data, we propose a stochastic optimization framework named "hybrid-RL" to efficiently guide process development and control. We first create the bioprocess probabilistic knowledge graph that is a hybrid model characterizing the understanding of biomanufacturing process mechanisms and quantifying inherent stochasticity, such as batch-to-batch variation and bioprocess noise. It can capture the key features, including nonlinear reactions, time-varying kinetics, and partially observed bioprocess state. This hybrid model can leverage on existing mechanistic models and facilitate the learning from process data. Given limited process data, a computational sampling approach is used to generate posterior samples quantifying the model estimation uncertainty. Then, we introduce hybrid model-based Bayesian reinforcement learning (RL), accounting for both inherent stochasticity and model uncertainty, to guide optimal, robust, and interpretable decision making, which can overcome the key challenges of cell therapy manufacturing. In the empirical study, cell therapy manufacturing examples are used to demonstrate that the proposed hybrid-RL framework can outperform the classical deterministic mechanistic model assisted process optimization.
    Benchmarks, Algorithms, and Metrics for Hierarchical Disentanglement. (arXiv:2102.05185v3 [cs.LG] UPDATED)
    (0 min) In representation learning, there has been recent interest in developing algorithms to disentangle the ground-truth generative factors behind a dataset, and metrics to quantify how fully this occurs. However, these algorithms and metrics often assume that both representations and ground-truth factors are flat, continuous, and factorized, whereas many real-world generative processes involve rich hierarchical structure, mixtures of discrete and continuous variables with dependence between them, and even varying intrinsic dimensionality. In this work, we develop benchmarks, algorithms, and metrics for learning such hierarchical representations.
    Collaborative Reflection-Augmented Autoencoder Network for Recommender Systems. (arXiv:2201.03158v1 [cs.IR])
    (0 min) As the deep learning techniques have expanded to real-world recommendation tasks, many deep neural network based Collaborative Filtering (CF) models have been developed to project user-item interactions into latent feature space, based on various neural architectures, such as multi-layer perceptron, auto-encoder and graph neural networks. However, the majority of existing collaborative filtering systems are not well designed to handle missing data. Particularly, in order to inject the negative signals in the training phase, these solutions largely rely on negative sampling from unobserved user-item interactions and simply treating them as negative instances, which brings the recommendation performance degradation. To address the issues, we develop a Collaborative Reflection-Augmented Autoencoder Network (CRANet), that is capable of exploring transferable knowledge from observed and unobserved user-item interactions. The network architecture of CRANet is formed of an integrative structure with a reflective receptor network and an information fusion autoencoder module, which endows our recommendation framework with the ability of encoding implicit user's pairwise preference on both interacted and non-interacted items. Additionally, a parametric regularization-based tied-weight scheme is designed to perform robust joint training of the two-stage CRANet model. We finally experimentally validate CRANet on four diverse benchmark datasets corresponding to two recommendation tasks, to show that debiasing the negative signals of user-item interactions improves the performance as compared to various state-of-the-art recommendation techniques. Our source code is available at https://github.com/akaxlh/CRANet.
    Near Optimality of Finite Memory Feedback Policies in Partially Observed Markov Decision Processes. (arXiv:2010.07452v2 [math.OC] UPDATED)
    (0 min) In the theory of Partially Observed Markov Decision Processes (POMDPs), existence of optimal policies have in general been established via converting the original partially observed stochastic control problem to a fully observed one on the belief space, leading to a belief-MDP. However, computing an optimal policy for this fully observed model, and so for the original POMDP, using classical dynamic or linear programming methods is challenging even if the original system has finite state and action spaces, since the state space of the fully observed belief-MDP model is always uncountable. Furthermore, there exist very few rigorous value function approximation and optimal policy approximation results, as regularity conditions needed often require a tedious study involving the spaces of probability measures leading to properties such as Feller continuity. In this paper, we study a planning problem for POMDPs where the system dynamics and measurement channel model are assumed to be known. We construct an approximate belief model by discretizing the belief space using only finite window information variables. We then find optimal policies for the approximate model and we rigorously establish near optimality of the constructed finite window control policies in POMDPs under mild non-linear filter stability conditions and the assumption that the measurement and action sets are finite (and the state space is real vector valued). We also establish a rate of convergence result which relates the finite window memory size and the approximation error bound, where the rate of convergence is exponential under explicit and testable exponential filter stability conditions. While there exist many experimental results and few rigorous asymptotic convergence results, an explicit rate of convergence result is new in the literature, to our knowledge.
    Distinguishing Natural and Computer-Generated Images using Multi-Colorspace fused EfficientNet. (arXiv:2110.09428v2 [cs.CV] UPDATED)
    (0 min) The problem of distinguishing natural images from photo-realistic computer-generated ones either addresses natural images versus computer graphics or natural images versus GAN images, at a time. But in a real-world image forensic scenario, it is highly essential to consider all categories of image generation, since in most cases image generation is unknown. We, for the first time, to our best knowledge, approach the problem of distinguishing natural images from photo-realistic computer-generated images as a three-class classification task classifying natural, computer graphics, and GAN images. For the task, we propose a Multi-Colorspace fused EfficientNet model by parallelly fusing three EfficientNet networks that follow transfer learning methodology where each network operates in different colorspaces, RGB, LCH, and HSV, chosen after analyzing the efficacy of various colorspace transformations in this image forensics problem. Our model outperforms the baselines in terms of accuracy, robustness towards post-processing, and generalizability towards other datasets. We conduct psychophysics experiments to understand how accurately humans can distinguish natural, computer graphics, and GAN images where we could observe that humans find difficulty in classifying these images, particularly the computer-generated images, indicating the necessity of computational algorithms for the task. We also analyze the behavior of our model through visual explanations to understand salient regions that contribute to the model's decision making and compare with manual explanations provided by human participants in the form of region markings, where we could observe similarities in both the explanations indicating the powerful nature of our model to take the decisions meaningfully.
    Video-Specific Autoencoders for Exploring, Editing and Transmitting Videos. (arXiv:2103.17261v2 [cs.CV] UPDATED)
    (0 min) We study video-specific autoencoders that allow a human user to explore, edit, and efficiently transmit videos. Prior work has independently looked at these problems (and sub-problems) and proposed different formulations. In this work, we train a simple autoencoder (from scratch) on multiple frames of a specific video. We observe: (1) latent codes learned by a video-specific autoencoder capture spatial and temporal properties of that video; and (2) autoencoders can project out-of-sample inputs onto the video-specific manifold. These two properties allow us to explore, edit, and efficiently transmit a video using one learned representation. For e.g., linear operations on latent codes allow users to visualize the contents of a video. Associating latent codes of a video and manifold projection enables users to make desired edits. Interpolating latent codes and manifold projection allows the transmission of sparse low-res frames over a network.
    Trade-offs between membership privacy & adversarially robust learning. (arXiv:2006.04622v2 [cs.LG] UPDATED)
    (0 min) Historically, machine learning methods have not been designed with security in mind. In turn, this has given rise to adversarial examples, carefully perturbed input samples aimed to mislead detection at test time, which have been applied to attack spam and malware classification, and more recently to attack image classification. Consequently, an abundance of research has been devoted to designing machine learning methods that are robust to adversarial examples. Unfortunately, there are desiderata besides robustness that a secure and safe machine learning model must satisfy, such as fairness and privacy. Recent work by Song et al. (2019) has shown, empirically, that there exists a trade-off between robust and private machine learning models. Models designed to be robust to adversarial examples often overfit on training data to a larger extent than standard (non-robust) models. If a dataset contains private information, then any statistical test that separates training and test data by observing a model's outputs can represent a privacy breach, and if a model overfits on training data, these statistical tests become easier. In this work, we identify settings where standard models will overfit to a larger extent in comparison to robust models, and as empirically observed in previous works, settings where the opposite behavior occurs. Thus, it is not necessarily the case that privacy must be sacrificed to achieve robustness. The degree of overfitting naturally depends on the amount of data available for training. We go on to characterize how the training set size factors into the privacy risks exposed by training a robust model on a simple Gaussian data task, and show empirically that our findings hold on image classification benchmark datasets, such as CIFAR-10 and CIFAR-100.
    Avoiding Overfitting: A Survey on Regularization Methods for Convolutional Neural Networks. (arXiv:2201.03299v1 [cs.CV])
    (0 min) Several image processing tasks, such as image classification and object detection, have been significantly improved using Convolutional Neural Networks (CNN). Like ResNet and EfficientNet, many architectures have achieved outstanding results in at least one dataset by the time of their creation. A critical factor in training concerns the network's regularization, which prevents the structure from overfitting. This work analyzes several regularization methods developed in the last few years, showing significant improvements for different CNN models. The works are classified into three main areas: the first one is called "data augmentation", where all the techniques focus on performing changes in the input data. The second, named "internal changes", which aims to describe procedures to modify the feature maps generated by the neural network or the kernels. The last one, called "label", concerns transforming the labels of a given input. This work presents two main differences comparing to other available surveys about regularization: (i) the first concerns the papers gathered in the manuscript, which are not older than five years, and (ii) the second distinction is about reproducibility, i.e., all works refered here have their code available in public repositories or they have been directly implemented in some framework, such as TensorFlow or Torch.
    Applying Artificial Intelligence for Age Estimation in Digital Forensic Investigations. (arXiv:2201.03045v1 [cs.CV])
    (0 min) The precise age estimation of child sexual abuse and exploitation (CSAE) victims is one of the most significant digital forensic challenges. Investigators often need to determine the age of victims by looking at images and interpreting the sexual development stages and other human characteristics. The main priority - safeguarding children -- is often negatively impacted by a huge forensic backlog, cognitive bias and the immense psychological stress that this work can entail. This paper evaluates existing facial image datasets and proposes a new dataset tailored to the needs of similar digital forensic research contributions. This small, diverse dataset of 0 to 20-year-old individuals contains 245 images and is merged with 82 unique images from the FG-NET dataset, thus achieving a total of 327 images with high image diversity and low age range density. The new dataset is tested on the Deep EXpectation (DEX) algorithm pre-trained on the IMDB-WIKI dataset. The overall results for young adolescents aged 10 to 15 and older adolescents/adults aged 16 to 20 are very encouraging -- achieving MAEs as low as 1.79, but also suggest that the accuracy for children aged 0 to 10 needs further work. In order to determine the efficacy of the prototype, valuable input of four digital forensic experts, including two forensic investigators, has been taken into account to improve age estimation results. Further research is required to extend datasets both concerning image density and the equal distribution of factors such as gender and racial diversity.
    Structure-preserving Gaussian Process Dynamics. (arXiv:2102.01606v3 [cs.LG] UPDATED)
    (0 min) Most physical processes posses structural properties such as constant energies, volumes, and other invariants over time. When learning models of such dynamical systems, it is critical to respect these invariants to ensure accurate predictions and physically meaningful behavior. Strikingly, state-of-the-art methods in Gaussian process (GP) dynamics model learning are not addressing this issue. On the other hand, classical numerical integrators are specifically designed to preserve these crucial properties through time. We propose to combine the advantages of GPs as function approximators with structure preserving numerical integrators for dynamical systems, such as Runge-Kutta methods. These integrators assume access to the ground truth dynamics and require evaluations of intermediate and future time steps that are unknown in a learning-based scenario. This makes direct inference of the GP dynamics, with embedded numerical scheme, intractable. Our key technical contribution is the evaluation of the implicitly defined Runge-Kutta transition probability. In a nutshell, we introduce an implicit layer for GP regression, which is embedded into a variational inference-based model learning scheme.
    Enhanced Doubly Robust Learning for Debiasing Post-click Conversion Rate Estimation. (arXiv:2105.13623v3 [cs.LG] UPDATED)
    (0 min) Post-click conversion, as a strong signal indicating the user preference, is salutary for building recommender systems. However, accurately estimating the post-click conversion rate (CVR) is challenging due to the selection bias, i.e., the observed clicked events usually happen on users' preferred items. Currently, most existing methods utilize counterfactual learning to debias recommender systems. Among them, the doubly robust (DR) estimator has achieved competitive performance by combining the error imputation based (EIB) estimator and the inverse propensity score (IPS) estimator in a doubly robust way. However, inaccurate error imputation may result in its higher variance than the IPS estimator. Worse still, existing methods typically use simple model-agnostic methods to estimate the imputation error, which are not sufficient to approximate the dynamically changing model-correlated target (i.e., the gradient direction of the prediction model). To solve these problems, we first derive the bias and variance of the DR estimator. Based on it, a more robust doubly robust (MRDR) estimator has been proposed to further reduce its variance while retaining its double robustness. Moreover, we propose a novel double learning approach for the MRDR estimator, which can convert the error imputation into the general CVR estimation. Besides, we empirically verify that the proposed learning scheme can further eliminate the high variance problem of the imputation learning. To evaluate its effectiveness, extensive experiments are conducted on a semi-synthetic dataset and two real-world datasets. The results demonstrate the superiority of the proposed approach over the state-of-the-art methods. The code is available at https://github.com/guosyjlu/MRDR-DL.
    NeoNav: Improving the Generalization of Visual Navigation via Generating Next Expected Observations. (arXiv:1906.07207v4 [cs.RO] UPDATED)
    (0 min) We propose improving the cross-target and cross-scene generalization of visual navigation through learning an agent that is guided by conceiving the next observations it expects to see. This is achieved by learning a variational Bayesian model, called NeoNav, which generates the next expected observations (NEO) conditioned on the current observations of the agent and the target view. Our generative model is learned through optimizing a variational objective encompassing two key designs. First, the latent distribution is conditioned on current observations and the target view, leading to a model-based, target-driven navigation. Second, the latent space is modeled with a Mixture of Gaussians conditioned on the current observation and the next best action. Our use of mixture-of-posteriors prior effectively alleviates the issue of over-regularized latent space, thus significantly boosting the model generalization for new targets and in novel scenes. Moreover, the NEO generation models the forward dynamics of agent-environment interaction, which improves the quality of approximate inference and hence benefits data efficiency. We have conducted extensive evaluations on both real-world and synthetic benchmarks, and show that our model consistently outperforms the state-of-the-art models in terms of success rate, data efficiency, and generalization.
    Morphological Analysis of Japanese Hiragana Sentences using the BI-LSTM CRF Model. (arXiv:2201.03366v1 [cs.CL])
    (0 min) This study proposes a method to develop neural models of the morphological analyzer for Japanese Hiragana sentences using the Bi-LSTM CRF model. Morphological analysis is a technique that divides text data into words and assigns information such as parts of speech. This technique plays an essential role in downstream applications in Japanese natural language processing systems because the Japanese language does not have word delimiters between words. Hiragana is a type of Japanese phonogramic characters, which is used for texts for children or people who cannot read Chinese characters. Morphological analysis of Hiragana sentences is more difficult than that of ordinary Japanese sentences because there is less information for dividing. For morphological analysis of Hiragana sentences, we demonstrated the effectiveness of fine-tuning using a model based on ordinary Japanese text and examined the influence of training data on texts of various genres.
    MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound. (arXiv:2201.02639v1 [cs.CV])
    (0 min) As humans, we navigate the world through all our senses, using perceptual input from each one to correct the others. We introduce MERLOT Reserve, a model that represents videos jointly over time -- through a new training objective that learns from audio, subtitles, and video frames. Given a video, we replace snippets of text and audio with a MASK token; the model learns by choosing the correct masked-out snippet. Our objective learns faster than alternatives, and performs well at scale: we pretrain on 20 million YouTube videos. Empirical results show that MERLOT Reserve learns strong representations about videos through all constituent modalities. When finetuned, it sets a new state-of-the-art on both VCR and TVQA, outperforming prior work by 5% and 7% respectively. Ablations show that both tasks benefit from audio pretraining -- even VCR, a QA task centered around images (without sound). Moreover, our objective enables out-of-the-box prediction, revealing strong multimodal commonsense understanding. In a fully zero-shot setting, our model obtains competitive results on four video understanding tasks, even outperforming supervised approaches on the recently proposed Situated Reasoning (STAR) benchmark. We analyze why incorporating audio leads to better vision-language representations, suggesting significant opportunities for future research. We conclude by discussing ethical and societal implications of multimodal pretraining.
    Compressing Models with Few Samples: Mimicking then Replacing. (arXiv:2201.02620v1 [cs.LG])
    (0 min) Few-sample compression aims to compress a big redundant model into a small compact one with only few samples. If we fine-tune models with these limited few samples directly, models will be vulnerable to overfit and learn almost nothing. Hence, previous methods optimize the compressed model layer-by-layer and try to make every layer have the same outputs as the corresponding layer in the teacher model, which is cumbersome. In this paper, we propose a new framework named Mimicking then Replacing (MiR) for few-sample compression, which firstly urges the pruned model to output the same features as the teacher's in the penultimate layer, and then replaces teacher's layers before penultimate with a well-tuned compact one. Unlike previous layer-wise reconstruction methods, our MiR optimizes the entire network holistically, which is not only simple and effective, but also unsupervised and general. MiR outperforms previous methods with large margins. Codes will be available soon.
    Reducing the Long Tail Losses in Scientific Emulations with Active Learning. (arXiv:2111.08498v2 [cs.LG] UPDATED)
    (0 min) Deep-learning-based models are increasingly used to emulate scientific simulations to accelerate scientific research. However, accurate, supervised deep learning models require huge amount of labelled data, and that often becomes the bottleneck in employing neural networks. In this work, we leveraged an active learning approach called core-set selection to actively select data, per a pre-defined budget, to be labelled for training. To further improve the model performance and reduce the training costs, we also warm started the training using a shrink-and-perturb trick. We tested on two case studies in different fields, namely galaxy halo occupation distribution modelling in astrophysics and x-ray emission spectroscopy in plasma physics, and the results are promising: we achieved competitive overall performance compared to using a random sampling baseline, and more importantly, successfully reduced the larger absolute losses, i.e. the long tail in the loss distribution, at virtually no overhead costs.
    The State of Aerial Surveillance: A Survey. (arXiv:2201.03080v1 [cs.CV])
    (0 min) The rapid emergence of airborne platforms and imaging sensors are enabling new forms of aerial surveillance due to their unprecedented advantages in scale, mobility, deployment and covert observation capabilities. This paper provides a comprehensive overview of human-centric aerial surveillance tasks from a computer vision and pattern recognition perspective. It aims to provide readers with an in-depth systematic review and technical analysis of the current state of aerial surveillance tasks using drones, UAVs and other airborne platforms. The main object of interest is humans, where single or multiple subjects are to be detected, identified, tracked, re-identified and have their behavior analyzed. More specifically, for each of these four tasks, we first discuss unique challenges in performing these tasks in an aerial setting compared to a ground-based setting. We then review and analyze the aerial datasets publicly available for each task, and delve deep into the approaches in the aerial literature and investigate how they presently address the aerial challenges. We conclude the paper with discussion on the missing gaps and open research questions to inform future research avenues.
    A novel interpretable machine learning system to generate clinical risk scores: An application for predicting early mortality or unplanned readmission in a retrospective cohort study. (arXiv:2201.03291v1 [cs.LG])
    (0 min) Risk scores are widely used for clinical decision making and commonly generated from logistic regression models. Machine-learning-based methods may work well for identifying important predictors, but such 'black box' variable selection limits interpretability, and variable importance evaluated from a single model can be biased. We propose a robust and interpretable variable selection approach using the recently developed Shapley variable importance cloud (ShapleyVIC) that accounts for variability across models. Our approach evaluates and visualizes overall variable contributions for in-depth inference and transparent variable selection, and filters out non-significant contributors to simplify model building steps. We derive an ensemble variable ranking from variable contributions, which is easily integrated with an automated and modularized risk score generator, AutoScore, for convenient implementation. In a study of early death or unplanned readmission, ShapleyVIC selected 6 of 41 candidate variables to create a well-performing model, which had similar performance to a 16-variable model from machine-learning-based ranking.
    Teaching CNNs to mimic Human Visual Cognitive Process & regularise Texture-Shape bias. (arXiv:2006.14722v2 [cs.CV] UPDATED)
    (0 min) Recent experiments in computer vision demonstrate texture bias as the primary reason for supreme results in models employing Convolutional Neural Networks (CNNs), conflicting with early works claiming that these networks identify objects using shape. It is believed that the cost function forces the CNN to take a greedy approach and develop a proclivity for local information like texture to increase accuracy, thus failing to explore any global statistics. We propose CognitiveCNN, a new intuitive architecture, inspired from feature integration theory in psychology to utilise human interpretable feature like shape, texture, edges etc. to reconstruct, and classify the image. We define novel metrics to quantify the "relevance" of "abstract information" present in these modalities using attention maps. We further introduce a regularisation method which ensures that each modality like shape, texture etc. gets proportionate influence in a given task, as it does for reconstruction; and perform experiments to show the resulting boost in accuracy and robustness, besides imparting explainability to these CNNs for achieving superior performance in object recognition.
    Auto-Encoder based Co-Training Multi-View Representation Learning. (arXiv:2201.02978v1 [cs.LG])
    (0 min) Multi-view learning is a learning problem that utilizes the various representations of an object to mine valuable knowledge and improve the performance of learning algorithm, and one of the significant directions of multi-view learning is sub-space learning. As we known, auto-encoder is a method of deep learning, which can learn the latent feature of raw data by reconstructing the input, and based on this, we propose a novel algorithm called Auto-encoder based Co-training Multi-View Learning (ACMVL), which utilizes both complementarity and consistency and finds a joint latent feature representation of multiple views. The algorithm has two stages, the first is to train auto-encoder of each view, and the second stage is to train a supervised network. Interestingly, the two stages share the weights partly and assist each other by co-training process. According to the experimental result, we can learn a well performed latent feature representation, and auto-encoder of each view has more powerful reconstruction ability than traditional auto-encoder.
    Self-supervised Spatiotemporal Representation Learning by Exploiting Video Continuity. (arXiv:2112.05883v2 [cs.CV] UPDATED)
    (0 min) Recent self-supervised video representation learning methods have found significant success by exploring essential properties of videos, e.g. speed, temporal order, etc. This work exploits an essential yet under-explored property of videos, the video continuity, to obtain supervision signals for self-supervised representation learning. Specifically, we formulate three novel continuity-related pretext tasks, i.e. continuity justification, discontinuity localization, and missing section approximation, that jointly supervise a shared backbone for video representation learning. This self-supervision approach, termed as Continuity Perception Network (CPNet), solves the three tasks altogether and encourages the backbone network to learn local and long-ranged motion and context representations. It outperforms prior arts on multiple downstream tasks, such as action recognition, video retrieval, and action localization. Additionally, the video continuity can be complementary to other coarse-grained video properties for representation learning, and integrating the proposed pretext task to prior arts can yield much performance gains.
    Fully automatic scoring of handwritten descriptive answers in Japanese language tests. (arXiv:2201.03215v1 [cs.LG])
    (0 min) This paper presents an experiment of automatically scoring handwritten descriptive answers in the trial tests for the new Japanese university entrance examination, which were made for about 120,000 examinees in 2017 and 2018. There are about 400,000 answers with more than 20 million characters. Although all answers have been scored by human examiners, handwritten characters are not labelled. We present our attempt to adapt deep neural network-based handwriting recognizers trained on a labelled handwriting dataset into this unlabeled answer set. Our proposed method combines different training strategies, ensembles multiple recognizers, and uses a language model built from a large general corpus to avoid overfitting into specific data. In our experiment, the proposed method records character accuracy of over 97% using about 2,000 verified labelled answers that account for less than 0.5% of the dataset. Then, the recognized answers are fed into a pre-trained automatic scoring system based on the BERT model without correcting misrecognized characters and providing rubric annotations. The automatic scoring system achieves from 0.84 to 0.98 of Quadratic Weighted Kappa (QWK). As QWK is over 0.8, it represents acceptable similarity of scoring between the automatic scoring system and the human examiners. These results are promising for further research on end-to-end automatic scoring of descriptive answers.
    Modeling and Analysis of Intermittent Federated Learning Over Cellular-Connected UAV Networks. (arXiv:2110.07077v2 [cs.LG] UPDATED)
    (0 min) Federated learning (FL) is a promising distributed learning technique particularly suitable for wireless learning scenarios since it can accomplish a learning task without raw data transportation so as to preserve data privacy and lower network resource consumption. However, current works on FL over wireless networks do not profoundly study the fundamental performance of FL over wireless networks that suffers from communication outage due to channel impairment and network interference. To accurately exploit the performance of FL over wireless networks, this paper proposes a novel intermittent FL model over a cellular-connected unmanned aerial vehicle (UAV) network, which characterizes communication outage from UAV (clients) to their server and data heterogeneity among the datasets at UAVs. We propose an analytically tractable framework to derive the uplink outage probability and use it to devise a simulation-based approach so as to evaluate the performance of the proposed intermittent FL model. Our findings reveal how the intermittent FL model is impacted by uplink communication outage and UAV deployment. Extensive numerical simulations are provided to show the consistency between the simulated and analytical performances of the proposed intermittent FL model.
    Meta-Generalization for Multiparty Privacy Learning to Identify Anomaly Multimedia Traffic in Graynet. (arXiv:2201.03027v1 [cs.CR])
    (0 min) Identifying anomaly multimedia traffic in cyberspace is a big challenge in distributed service systems, multiple generation networks and future internet of everything. This letter explores meta-generalization for a multiparty privacy learning model in graynet to improve the performance of anomaly multimedia traffic identification. The multiparty privacy learning model in graynet is a globally shared model that is partitioned, distributed and trained by exchanging multiparty parameters updates with preserving private data. The meta-generalization refers to discovering the inherent attributes of a learning model to reduce its generalization error. In experiments, three meta-generalization principles are tested as follows. The generalization error of the multiparty privacy learning model in graynet is reduced by changing the dimension of byte-level imbedding. Following that, the error is reduced by adapting the depth for extracting packet-level features. Finally, the error is reduced by adjusting the size of support set for preprocessing traffic-level data. Experimental results demonstrate that the proposal outperforms the state-of-the-art learning models for identifying anomaly multimedia traffic.
    An Interpretable Federated Learning-based Network Intrusion Detection Framework. (arXiv:2201.03134v1 [cs.CR])
    (0 min) Learning-based Network Intrusion Detection Systems (NIDSs) are widely deployed for defending various cyberattacks. Existing learning-based NIDS mainly uses Neural Network (NN) as a classifier that relies on the quality and quantity of cyberattack data. Such NN-based approaches are also hard to interpret for improving efficiency and scalability. In this paper, we design a new local-global computation paradigm, FEDFOREST, a novel learning-based NIDS by combining the interpretable Gradient Boosting Decision Tree (GBDT) and Federated Learning (FL) framework. Specifically, FEDFOREST is composed of multiple clients that extract local cyberattack data features for the server to train models and detect intrusions. A privacy-enhanced technology is also proposed in FEDFOREST to further defeat the privacy of the FL systems. Extensive experiments on 4 cyberattack datasets of different tasks demonstrate that FEDFOREST is effective, efficient, interpretable, and extendable. FEDFOREST ranks first in the collaborative learning and cybersecurity competition 2021 for Chinese college students.
    Uncovering the Source of Machine Bias. (arXiv:2201.03092v1 [cs.LG])
    (0 min) We develop a structural econometric model to capture the decision dynamics of human evaluators on an online micro-lending platform, and estimate the model parameters using a real-world dataset. We find two types of biases in gender, preference-based bias and belief-based bias, are present in human evaluators' decisions. Both types of biases are in favor of female applicants. Through counterfactual simulations, we quantify the effect of gender bias on loan granting outcomes and the welfare of the company and the borrowers. Our results imply that both the existence of the preference-based bias and that of the belief-based bias reduce the company's profits. When the preference-based bias is removed, the company earns more profits. When the belief-based bias is removed, the company's profits also increase. Both increases result from raising the approval probability for borrowers, especially male borrowers, who eventually pay back loans. For borrowers, the elimination of either bias decreases the gender gap of the true positive rates in the credit risk evaluation. We also train machine learning algorithms on both the real-world data and the data from the counterfactual simulations. We compare the decisions made by those algorithms to see how evaluators' biases are inherited by the algorithms and reflected in machine-based decisions. We find that machine learning algorithms can mitigate both the preference-based bias and the belief-based bias.
    Communication-Efficient Federated Learning with Acceleration of Global Momentum. (arXiv:2201.03172v1 [cs.LG])
    (0 min) Federated learning often suffers from unstable and slow convergence due to heterogeneous characteristics of participating clients. Such tendency is aggravated when the client participation ratio is low since the information collected from the clients at each round is prone to be more inconsistent. To tackle the challenge, we propose a novel federated learning framework, which improves the stability of the server-side aggregation step, which is achieved by sending the clients an accelerated model estimated with the global gradient to guide the local gradient updates. Our algorithm naturally aggregates and conveys the global update information to participants with no additional communication cost and does not require to store the past models in the clients. We also regularize local update to further reduce the bias and improve the stability of local updates. We perform comprehensive empirical studies on real data under various settings and demonstrate the remarkable performance of the proposed method in terms of accuracy and communication-efficiency compared to the state-of-the-art methods, especially with low client participation rates. Our code is available at https://github.com/ ninigapa0/FedAGM
    Differentially Private $\ell_1$-norm Linear Regression with Heavy-tailed Data. (arXiv:2201.03204v1 [cs.LG])
    (0 min) We study the problem of Differentially Private Stochastic Convex Optimization (DP-SCO) with heavy-tailed data. Specifically, we focus on the $\ell_1$-norm linear regression in the $\epsilon$-DP model. While most of the previous work focuses on the case where the loss function is Lipschitz, here we only need to assume the variates has bounded moments. Firstly, we study the case where the $\ell_2$ norm of data has bounded second order moment. We propose an algorithm which is based on the exponential mechanism and show that it is possible to achieve an upper bound of $\tilde{O}(\sqrt{\frac{d}{n\epsilon}})$ (with high probability). Next, we relax the assumption to bounded $\theta$-th order moment with some $\theta\in (1, 2)$ and show that it is possible to achieve an upper bound of $\tilde{O}(({\frac{d}{n\epsilon}})^\frac{\theta-1}{\theta})$. Our algorithms can also be extended to more relaxed cases where only each coordinate of the data has bounded moments, and we can get an upper bound of $\tilde{O}({\frac{d}{\sqrt{n\epsilon}}})$ and $\tilde{O}({\frac{d}{({n\epsilon})^\frac{\theta-1}{\theta}}})$ in the second and $\theta$-th moment case respectively.
    TPAD: Identifying Effective Trajectory Predictions Under the Guidance of Trajectory Anomaly Detection Model. (arXiv:2201.02941v1 [cs.LG])
    (0 min) Trajectory Prediction (TP) is an important research topic in computer vision and robotics fields. Recently, many stochastic TP models have been proposed to deal with this problem and have achieved better performance than the traditional models with deterministic trajectory outputs. However, these stochastic models can generate a number of future trajectories with different qualities. They are lack of self-evaluation ability, that is, to examine the rationality of their prediction results, thus failing to guide users to identify high-quality ones from their candidate results. This hinders them from playing their best in real applications. In this paper, we make up for this defect and propose TPAD, a novel TP evaluation method based on the trajectory Anomaly Detection (AD) technique. In TPAD, we firstly combine the Automated Machine Learning (AutoML) technique and the experience in the AD and TP field to automatically design an effective trajectory AD model. Then, we utilize the learned trajectory AD model to examine the rationality of the predicted trajectories, and screen out good TP results for users. Extensive experimental results demonstrate that TPAD can effectively identify near-optimal prediction results, improving stochastic TP models' practical application effect.
    Comparing Sequential Forecasters. (arXiv:2110.00115v3 [stat.ME] UPDATED)
    (0 min) Consider two or more forecasters, each making a sequence of predictions for different events over time. We ask a relatively basic question: how might we compare these forecasters, either online or post-hoc, while avoiding unverifiable assumptions on how the forecasts or outcomes were generated? This work presents a novel and rigorous answer to this question. We design a sequential inference procedure for estimating the time-varying difference in forecast quality as measured by any scoring rule. The resulting confidence intervals are nonasymptotically valid and can be continuously monitored to yield statistically valid comparisons at arbitrary data-dependent stopping times ("anytime-valid"); this is enabled by adapting variance-adaptive supermartingales, confidence sequences, and e-processes to our setting. Motivated by Shafer and Vovk's game-theoretic probability, our coverage guarantees are also distribution-free, in the sense that they make no distributional assumptions on the forecasts or outcomes. In contrast to a recent work by Henzi and Ziegel, our tools can sequentially test a weak null hypothesis about whether one forecaster outperforms another on average over time. We demonstrate their effectiveness by comparing probability forecasts on Major League Baseball (MLB) games and statistical postprocessing methods for ensemble weather forecasts.
    Masked Gradient-Based Causal Structure Learning. (arXiv:1910.08527v3 [cs.LG] UPDATED)
    (0 min) This paper studies the problem of learning causal structures from observational data. We reformulate the Structural Equation Model (SEM) with additive noises in a form parameterized by binary graph adjacency matrix and show that, if the original SEM is identifiable, then the binary adjacency matrix can be identified up to super-graphs of the true causal graph under mild conditions. We then utilize the reformulated SEM to develop a causal structure learning method that can be efficiently trained using gradient-based optimization, by leveraging a smooth characterization on acyclicity and the Gumbel-Softmax approach to approximate the binary adjacency matrix. It is found that the obtained entries are typically near zero or one and can be easily thresholded to identify the edges. We conduct experiments on synthetic and real datasets to validate the effectiveness of the proposed method, and show that it readily includes different smooth model functions and achieves a much improved performance on most datasets considered.
    Stability Based Generalization Bounds for Exponential Family Langevin Dynamics. (arXiv:2201.03064v1 [cs.LG])
    (2 min) We study generalization bounds for noisy stochastic mini-batch iterative algorithms based on the notion of stability. Recent years have seen key advances in data-dependent generalization bounds for noisy iterative learning algorithms such as stochastic gradient Langevin dynamics (SGLD) based on stability (Mou et al., 2018; Li et al., 2020) and information theoretic approaches (Xu and Raginsky, 2017; Negrea et al., 2019; Steinke and Zakynthinou, 2020; Haghifam et al., 2020). In this paper, we unify and substantially generalize stability based generalization bounds and make three technical advances. First, we bound the generalization error of general noisy stochastic iterative algorithms (not necessarily gradient descent) in terms of expected (not uniform) stability. The expected stability can in turn be bounded by a Le Cam Style Divergence. Such bounds have a O(1/n) sample dependence unlike many existing bounds with O(1/\sqrt{n}) dependence. Second, we introduce Exponential Family Langevin Dynamics(EFLD) which is a substantial generalization of SGLD and which allows exponential family noise to be used with stochastic gradient descent (SGD). We establish data-dependent expected stability based generalization bounds for general EFLD algorithms. Third, we consider an important special case of EFLD: noisy sign-SGD, which extends sign-SGD using Bernoulli noise over {-1,+1}. Generalization bounds for noisy sign-SGD are implied by that of EFLD and we also establish optimization guarantees for the algorithm. Further, we present empirical results on benchmark datasets to illustrate that our bounds are non-vacuous and quantitatively much sharper than existing bounds.
    Fast solver for J2-perturbed Lambert problem using deep neural network. (arXiv:2201.02942v1 [math.NA])
    (2 min) This paper presents a novel and fast solver for the J2-perturbed Lambert problem. The solver consists of an intelligent initial guess generator combined with a differential correction procedure. The intelligent initial guess generator is a deep neural network that is trained to correct the initial velocity vector coming from the solution of the unperturbed Lambert problem. The differential correction module takes the initial guess and uses a forward shooting procedure to further update the initial velocity and exactly meet the terminal conditions. Eight sample forms are analyzed and compared to find the optimum form to train the neural network on the J2-perturbed Lambert problem. The accuracy and performance of this novel approach will be demonstrated on a representative test case: the solution of a multi-revolution J2-perturbed Lambert problem in the Jupiter system. We will compare the performance of the proposed approach against a classical standard shooting method and a homotopy-based perturbed Lambert algorithm. It will be shown that, for a comparable level of accuracy, the proposed method is significantly faster than the other two.
    Discriminant Analysis in Contrasting Dimensions for Polycystic Ovary Syndrome Prognostication. (arXiv:2201.03029v1 [cs.LG])
    (2 min) A lot of prognostication methodologies have been formulated for early detection of Polycystic Ovary Syndrome also known as PCOS using Machine Learning. PCOS is a binary classification problem. Dimensionality Reduction methods impact the performance of Machine Learning to a greater extent and using a Supervised Dimensionality Reduction method can give us a new edge to tackle this problem. In this paper we present Discriminant Analysis in different dimensions with Linear and Quadratic form for binary classification along with metrics. We were able to achieve good accuracy and less variation with Discriminant Analysis as compared to many commonly used classification algorithms with training accuracy reaching 97.37% and testing accuracy of 95.92% using Quadratic Discriminant Analysis. Paper also gives the analysis of data with visualizations for deeper understanding of problem.
    Hierarchical Graph-Convolutional Variational AutoEncoding for Generative Modelling of Human Motion. (arXiv:2111.12602v3 [cs.CV] UPDATED)
    (2 min) Models of human motion commonly focus either on trajectory prediction or action classification but rarely both. The marked heterogeneity and intricate compositionality of human motion render each task vulnerable to the data degradation and distributional shift common to real-world scenarios. A sufficiently expressive generative model of action could in theory enable data conditioning and distributional resilience within a unified framework applicable to both tasks. Here we propose a novel architecture based on hierarchical variational autoencoders and deep graph convolutional neural networks for generating a holistic model of action over multiple time-scales. We show this Hierarchical Graph-convolutional Variational Autoencoder (HG-VAE) to be capable of generating coherent actions, detecting out-of-distribution data, and imputing missing data by gradient ascent on the model's posterior. Trained and evaluated on H3.6M and the largest collection of open source human motion data, AMASS, we show HG-VAE can facilitate downstream discriminative learning better than baseline models.
    Systematic biases when using deep neural networks for annotating large catalogs of astronomical images. (arXiv:2201.03131v1 [astro-ph.GA])
    (2 min) Deep convolutional neural networks (DCNNs) have become the most common solution for automatic image annotation due to their non-parametric nature, good performance, and their accessibility through libraries such as TensorFlow. Among other fields, DCNNs are also a common approach to the annotation of large astronomical image databases acquired by digital sky surveys. One of the main downsides of DCNNs is the complex non-intuitive rules that make DCNNs act as a ``black box", providing annotations in a manner that is unclear to the user. Therefore, the user is often not able to know what information is used by the DCNNs for the classification. Here we demonstrate that the training of a DCNN is sensitive to the context of the training data such as the location of the objects in the sky. We show that for basic classification of elliptical and spiral galaxies, the sky location of the galaxies used for training affects the behavior of the algorithm, and leads to a small but consistent and statistically significant bias. That bias exhibits itself in the form of cosmological-scale anisotropy in the distribution of basic galaxy morphology. Therefore, while DCNNs are powerful tools for annotating images of extended sources, the construction of training sets for galaxy morphology should take into consideration more aspects than the visual appearance of the object. In any case, catalogs created with deep neural networks that exhibit signs of cosmological anisotropy should be interpreted with the possibility of consistent bias.
    COVID-19 Infection Segmentation from Chest CT Images Based on Scale Uncertainty. (arXiv:2201.03053v1 [eess.IV])
    (3 min) This paper proposes a segmentation method of infection regions in the lung from CT volumes of COVID-19 patients. COVID-19 spread worldwide, causing many infected patients and deaths. CT image-based diagnosis of COVID-19 can provide quick and accurate diagnosis results. An automated segmentation method of infection regions in the lung provides a quantitative criterion for diagnosis. Previous methods employ whole 2D image or 3D volume-based processes. Infection regions have a considerable variation in their sizes. Such processes easily miss small infection regions. Patch-based process is effective for segmenting small targets. However, selecting the appropriate patch size is difficult in infection region segmentation. We utilize the scale uncertainty among various receptive field sizes of a segmentation FCN to obtain infection regions. The receptive field sizes can be defined as the patch size and the resolution of volumes where patches are clipped from. This paper proposes an infection segmentation network (ISNet) that performs patch-based segmentation and a scale uncertainty-aware prediction aggregation method that refines the segmentation result. We design ISNet to segment infection regions that have various intensity values. ISNet has multiple encoding paths to process patch volumes normalized by multiple intensity ranges. We collect prediction results generated by ISNets having various receptive field sizes. Scale uncertainty among the prediction results is extracted by the prediction aggregation method. We use an aggregation FCN to generate a refined segmentation result considering scale uncertainty among the predictions. In our experiments using 199 chest CT volumes of COVID-19 cases, the prediction aggregation method improved the dice similarity score from 47.6% to 62.1%.
    Differentially Private Generative Adversarial Networks with Model Inversion. (arXiv:2201.03139v1 [cs.LG])
    (2 min) To protect sensitive data in training a Generative Adversarial Network (GAN), the standard approach is to use differentially private (DP) stochastic gradient descent method in which controlled noise is added to the gradients. The quality of the output synthetic samples can be adversely affected and the training of the network may not even converge in the presence of these noises. We propose Differentially Private Model Inversion (DPMI) method where the private data is first mapped to the latent space via a public generator, followed by a lower-dimensional DP-GAN with better convergent properties. Experimental results on standard datasets CIFAR10 and SVHN as well as on a facial landmark dataset for Autism screening show that our approach outperforms the standard DP-GAN method based on Inception Score, Fr\'echet Inception Distance, and classification accuracy under the same privacy guarantee.
    Detecting CAN Masquerade Attacks with Signal Clustering Similarity. (arXiv:2201.02665v1 [cs.CR])
    (2 min) Vehicular Controller Area Networks (CANs) are susceptible to cyber attacks of different levels of sophistication. Fabrication attacks are the easiest to administer -- an adversary simply sends (extra) frames on a CAN -- but also the easiest to detect because they disrupt frame frequency. To overcome time-based detection methods, adversaries must administer masquerade attacks by sending frames in lieu of (and therefore at the expected time of) benign frames but with malicious payloads. Research efforts have proven that CAN attacks, and masquerade attacks in particular, can affect vehicle functionality. Examples include causing unintended acceleration, deactivation of vehicle's brakes, as well as steering the vehicle. We hypothesize that masquerade attacks modify the nuanced correlations of CAN signal time series and how they cluster together. Therefore, changes in cluster assignments should indicate anomalous behavior. We confirm this hypothesis by leveraging our previously developed capability for reverse engineering CAN signals (i.e., CAN-D [Controller Area Network Decoder]) and focus on advancing the state of the art for detecting masquerade attacks by analyzing time series extracted from raw CAN frames. Specifically, we demonstrate that masquerade attacks can be detected by computing time series clustering similarity using hierarchical clustering on the vehicle's CAN signals (time series) and comparing the clustering similarity across CAN captures with and without attacks. We test our approach in a previously collected CAN dataset with masquerade attacks (i.e., the ROAD dataset) and develop a forensic tool as a proof of concept to demonstrate the potential of the proposed approach for detecting CAN masquerade attacks.
    Embracing New Techniques in Deep Learning for Estimating Image Memorability. (arXiv:2105.10598v3 [cs.CV] UPDATED)
    (2 min) Various work has suggested that the memorability of an image is consistent across people, and thus can be treated as an intrinsic property of an image. Using computer vision models, we can make specific predictions about what people will remember or forget. While older work has used now-outdated deep learning architectures to predict image memorability, innovations in the field have given us new techniques to apply to this problem. Here, we propose and evaluate five alternative deep learning models which exploit developments in the field from the last five years, largely the introduction of residual neural networks, which are intended to allow the model to use semantic information in the memorability estimation process. These new models were tested against the prior state of the art with a combined dataset built to optimize both within-category and across-category predictions. Our findings suggest that the key prior memorability network had overstated its generalizability and was overfit on its training set. Our new models outperform this prior model, leading us to conclude that Residual Networks outperform simpler convolutional neural networks in memorability regression. We make our new state-of-the-art model readily available to the research community, allowing memory researchers to make predictions about memorability on a wider range of images.
    Robust classification with flexible discriminant analysis in heterogeneous data. (arXiv:2201.02967v1 [stat.ML])
    (2 min) Linear and Quadratic Discriminant Analysis are well-known classical methods but can heavily suffer from non-Gaussian distributions and/or contaminated datasets, mainly because of the underlying Gaussian assumption that is not robust. To fill this gap, this paper presents a new robust discriminant analysis where each data point is drawn by its own arbitrary Elliptically Symmetrical (ES) distribution and its own arbitrary scale parameter. Such a model allows for possibly very heterogeneous, independent but non-identically distributed samples. After deriving a new decision rule, it is shown that maximum-likelihood parameter estimation and classification are very simple, fast and robust compared to state-of-the-art methods.
    Least-Squares ReLU Neural Network (LSNN) Method For Scalar Nonlinear Hyperbolic Conservation Law. (arXiv:2105.11627v3 [math.NA] UPDATED)
    (2 min) We introduced the least-squares ReLU neural network (LSNN) method for solving the linear advection-reaction problem with discontinuous solution and showed that the method outperforms mesh-based numerical methods in terms of the number of degrees of freedom. This paper studies the LSNN method for scalar nonlinear hyperbolic conservation law. The method is a discretization of an equivalent least-squares (LS) formulation in the set of neural network functions with the ReLU activation function. Evaluation of the LS functional is done by using numerical integration and conservative finite volume scheme. Numerical results of some test problems show that the method is capable of approximating the discontinuous interface of the underlying problem automatically through the free breaking lines of the ReLU neural network. Moreover, the method does not exhibit the common Gibbs phenomena along the discontinuous interface.
    Towards the Next 1000 Languages in Multilingual Machine Translation: Exploring the Synergy Between Supervised and Self-Supervised Learning. (arXiv:2201.03110v1 [cs.CL])
    (2 min) Achieving universal translation between all human language pairs is the holy-grail of machine translation (MT) research. While recent progress in massively multilingual MT is one step closer to reaching this goal, it is becoming evident that extending a multilingual MT system simply by training on more parallel data is unscalable, since the availability of labeled data for low-resource and non-English-centric language pairs is forbiddingly limited. To this end, we present a pragmatic approach towards building a multilingual MT model that covers hundreds of languages, using a mixture of supervised and self-supervised objectives, depending on the data availability for different language pairs. We demonstrate that the synergy between these two training paradigms enables the model to produce high-quality translations in the zero-resource setting, even surpassing supervised translation quality for low- and mid-resource languages. We conduct a wide array of experiments to understand the effect of the degree of multilingual supervision, domain mismatches and amounts of parallel and monolingual data on the quality of our self-supervised multilingual models. To demonstrate the scalability of the approach, we train models with over 200 languages and demonstrate high performance on zero-resource translation on several previously under-studied languages. We hope our findings will serve as a stepping stone towards enabling translation for the next thousand languages.
    Medication Error Detection Using Contextual Language Models. (arXiv:2201.03035v1 [cs.CL])
    (2 min) Medication errors most commonly occur at the ordering or prescribing stage, potentially leading to medical complications and poor health outcomes. While it is possible to catch these errors using different techniques; the focus of this work is on textual and contextual analysis of prescription information to detect and prevent potential medication errors. In this paper, we demonstrate how to use BERT-based contextual language models to detect anomalies in written or spoken text based on a data set extracted from real-world medical data of thousands of patient records. The proposed models are able to learn patterns of text dependency and predict erroneous output based on contextual information such as patient data. The experimental results yield accuracy up to 96.63% for text input and up to 79.55% for speech input, which is satisfactory for most real-world applications.
    Learning with less labels in Digital Pathology via Scribble Supervision from natural images. (arXiv:2201.02627v1 [eess.IV])
    (2 min) A critical challenge of training deep learning models in the Digital Pathology (DP) domain is the high annotation cost by medical experts. One way to tackle this issue is via transfer learning from the natural image domain (NI), where the annotation cost is considerably cheaper. Cross-domain transfer learning from NI to DP is shown to be successful via class labels~\cite{teh2020learning}. One potential weakness of relying on class labels is the lack of spatial information, which can be obtained from spatial labels such as full pixel-wise segmentation labels and scribble labels. We demonstrate that scribble labels from NI domain can boost the performance of DP models on two cancer classification datasets (Patch Camelyon Breast Cancer and Colorectal Cancer dataset). Furthermore, we show that models trained with scribble labels yield the same performance boost as full pixel-wise segmentation labels despite being significantly easier and faster to collect.
    Traversing the Local Polytopes of ReLU Neural Networks: A Unified Approach for Network Verification. (arXiv:2111.08922v2 [cs.LG] UPDATED)
    (2 min) Although neural networks (NNs) with ReLU activation functions have found success in a wide range of applications, their adoption in risk-sensitive settings has been limited by the concerns on robustness and interpretability. Previous works to examine robustness and to improve interpretability partially exploited the piecewise linear function form of ReLU NNs. In this paper, we explore the unique topological structure that ReLU NNs create in the input space, identifying the adjacency among the partitioned local polytopes and developing a traversing algorithm based on this adjacency. Our polytope traversing algorithm can be adapted to verify a wide range of network properties related to robustness and interpretability, providing an unified approach to examine the network behavior. As the traversing algorithm explicitly visits all local polytopes, it returns a clear and full picture of the network behavior within the traversed region. The time and space complexity of the traversing algorithm is determined by the number of a ReLU NN's partitioning hyperplanes passing through the traversing region.
    On the Double Descent of Random Features Models Trained with SGD. (arXiv:2110.06910v4 [stat.ML] UPDATED)
    (2 min) This paper studies generalization properties of random features (RF) regression in high dimensions optimized by stochastic gradient descent (SGD). In this regime, we derive precise non-asymptotic error bounds of RF regression under both constant and adaptive step-size SGD setting, and observe the double descent phenomenon both theoretically and empirically. Our analysis shows how to cope with multiple randomness sources of initialization, label noise, and data sampling (as well as stochastic gradients) with no closed-form solution, and also goes beyond the commonly-used Gaussian/spherical data assumption. Our theoretical results demonstrate that, with SGD training, RF regression still generalizes well for interpolation learning, and is able to characterize the double descent behavior by the unimodality of variance and monotonic decrease of bias. Besides, we also prove that the constant step-size SGD setting incurs no loss in convergence rate when compared to the exact minimal-norm interpolator, as a theoretical justification of using SGD in practice.
    Rethink Stealthy Backdoor Attacks in Natural Language Processing. (arXiv:2201.02993v1 [cs.CL])
    (2 min) Recently, it has been shown that natural language processing (NLP) models are vulnerable to a kind of security threat called the Backdoor Attack, which utilizes a `backdoor trigger' paradigm to mislead the models. The most threatening backdoor attack is the stealthy backdoor, which defines the triggers as text style or syntactic. Although they have achieved an incredible high attack success rate (ASR), we find that the principal factor contributing to their ASR is not the `backdoor trigger' paradigm. Thus the capacity of these stealthy backdoor attacks is overestimated when categorized as backdoor attacks. Therefore, to evaluate the real attack power of backdoor attacks, we propose a new metric called attack successful rate difference (ASRD), which measures the ASR difference between clean state and poison state models. Besides, since the defenses against stealthy backdoor attacks are absent, we propose Trigger Breaker, consisting of two too simple tricks that can defend against stealthy backdoor attacks effectively. Experiments on text classification tasks show that our method achieves significantly better performance than state-of-the-art defense methods against stealthy backdoor attacks.
    CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows. (arXiv:2107.00652v3 [cs.CV] UPDATED)
    (2 min) We present CSWin Transformer, an efficient and effective Transformer-based backbone for general-purpose vision tasks. A challenging issue in Transformer design is that global self-attention is very expensive to compute whereas local self-attention often limits the field of interactions of each token. To address this issue, we develop the Cross-Shaped Window self-attention mechanism for computing self-attention in the horizontal and vertical stripes in parallel that form a cross-shaped window, with each stripe obtained by splitting the input feature into stripes of equal width. We provide a mathematical analysis of the effect of the stripe width and vary the stripe width for different layers of the Transformer network which achieves strong modeling capability while limiting the computation cost. We also introduce Locally-enhanced Positional Encoding (LePE), which handles the local positional information better than existing encoding schemes. LePE naturally supports arbitrary input resolutions, and is thus especially effective and friendly for downstream tasks. Incorporated with these designs and a hierarchical structure, CSWin Transformer demonstrates competitive performance on common vision tasks. Specifically, it achieves 85.4\% Top-1 accuracy on ImageNet-1K without any extra training data or label, 53.9 box AP and 46.4 mask AP on the COCO detection task, and 52.2 mIOU on the ADE20K semantic segmentation task, surpassing previous state-of-the-art Swin Transformer backbone by +1.2, +2.0, +1.4, and +2.0 respectively under the similar FLOPs setting. By further pretraining on the larger dataset ImageNet-21K, we achieve 87.5% Top-1 accuracy on ImageNet-1K and high segmentation performance on ADE20K with 55.7 mIoU. The code and models are available at https://github.com/microsoft/CSWin-Transformer.
    Cluster Regularization via a Hierarchical Feature Regression. (arXiv:2107.04831v2 [stat.ML] UPDATED)
    (2 min) This paper proposes a novel graph-based regularized regression estimator - the hierarchical feature regression (HFR) -, which mobilizes insights from the domains of machine learning and graph theory to estimate robust parameters for a linear regression. The estimator constructs a supervised feature graph that decomposes parameters along its edges, adjusting first for common variation and successively incorporating idiosyncratic patterns into the fitting process. The graph structure has the effect of shrinking parameters towards group targets, where the extent of shrinkage is governed by a hyperparamter, and group compositions as well as shrinkage targets are determined endogenously. The method offers rich resources for the visual exploration of the latent effect structure in the data, and demonstrates good predictive accuracy and versatility when compared to a panel of commonly used regularization techniques across a range of empirical and simulated regression tasks.
    An Adaptive Neuro-Fuzzy System with Integrated Feature Selection and Rule Extraction for High-Dimensional Classification Problems. (arXiv:2201.03187v1 [cs.LG])
    (2 min) A major limitation of fuzzy or neuro-fuzzy systems is their failure to deal with high-dimensional datasets. This happens primarily due to the use of T-norm, particularly, product or minimum (or a softer version of it). Thus, there are hardly any work dealing with datasets with dimensions more than hundred or so. Here, we propose a neuro-fuzzy framework that can handle datasets with dimensions even more than 7000! In this context, we propose an adaptive softmin (Ada-softmin) which effectively overcomes the drawbacks of ``numeric underflow" and ``fake minimum" that arise for existing fuzzy systems while dealing with high-dimensional problems. We call it an Adaptive Takagi-Sugeno-Kang (AdaTSK) fuzzy system. We then equip the AdaTSK system to perform feature selection and rule extraction in an integrated manner. In this context, a novel gate function is introduced and embedded only in the consequent parts, which can determine the useful features and rules, in two successive phases of learning. Unlike conventional fuzzy rule bases, we design an enhanced fuzzy rule base (En-FRB), which maintains adequate rules but does not grow the number of rules exponentially with dimension that typically happens for fuzzy neural networks. The integrated Feature Selection and Rule Extraction AdaTSK (FSRE-AdaTSK) system consists of three sequential phases: (i) feature selection, (ii) rule extraction, and (iii) fine tuning. The effectiveness of the FSRE-AdaTSK is demonstrated on 19 datasets of which five are in more than 2000 dimension including two with dimension greater than 7000. This may be the first time fuzzy systems are realized for classification involving more than 7000 input features.
    Conditional Approximate Normalizing Flows for Joint Multi-Step Probabilistic Electricity Demand Forecasting. (arXiv:2201.02753v1 [cs.LG])
    (2 min) Some real-world decision-making problems require making probabilistic forecasts over multiple steps at once. However, methods for probabilistic forecasting may fail to capture correlations in the underlying time-series that exist over long time horizons as errors accumulate. One such application is with resource scheduling under uncertainty in a grid environment, which requires forecasting electricity demand that is inherently noisy, but often cyclic. In this paper, we introduce the conditional approximate normalizing flow (CANF) to make probabilistic multi-step time-series forecasts when correlations are present over long time horizons. We first demonstrate our method's efficacy on estimating the density of a toy distribution, finding that CANF improves the KL divergence by one-third compared to that of a Gaussian mixture model while still being amenable to explicit conditioning. We then use a publicly available household electricity consumption dataset to showcase the effectiveness of CANF on joint probabilistic multi-step forecasting. Empirical results show that conditional approximate normalizing flows outperform other methods in terms of multi-step forecast accuracy and lead to up to 10x better scheduling decisions. Our implementation is available at https://github.com/sisl/JointDemandForecasting.
    FedDTG:Federated Data-Free Knowledge Distillation via Three-Player Generative Adversarial Networks. (arXiv:2201.03169v1 [cs.LG])
    (2 min) Applying knowledge distillation to personalized cross-silo federated learning can well alleviate the problem of user heterogeneity. This approach, however, requires a proxy dataset, which is difficult to obtain in the real world. Moreover, the global model based on parameter averaging will lead to the leakage of user privacy. We introduce a distributed three-player GAN to implement datafree co-distillation between clients. This technique mitigates the user heterogeneity problem and better protects user privacy. We confirmed that thefake samples generated by GAN can make federated distillation more efficient and robust, and the co-distillation can achieve good performance for individual clients on the basis of obtaining global knowledge. Our extensive experiments on benchmark datasets demonstrate the superior generalization performance of the proposed methods, compared with the state-of-the-art.
    RARA: Zero-shot Sim2Real Visual Navigation with Following Foreground Cues. (arXiv:2201.02798v1 [cs.CV])
    (2 min) The gap between simulation and the real-world restrains many machine learning breakthroughs in computer vision and reinforcement learning from being applicable in the real world. In this work, we tackle this gap for the specific case of camera-based navigation, formulating it as following a visual cue in the foreground with arbitrary backgrounds. The visual cue in the foreground can often be simulated realistically, such as a line, gate or cone. The challenge then lies in coping with the unknown backgrounds and integrating both. As such, the goal is to train a visual agent on data captured in an empty simulated environment except for this foreground cue and test this model directly in a visually diverse real world. In order to bridge this big gap, we show it's crucial to combine following techniques namely: Randomized augmentation of the fore- and background, regularization with both deep supervision and triplet loss and finally abstraction of the dynamics by using waypoints rather than direct velocity commands. The various techniques are ablated in our experimental results both qualitatively and quantitatively finally demonstrating a successful transfer from simulation to the real world.
    Specializing Versatile Skill Libraries using Local Mixture of Experts. (arXiv:2112.04216v2 [cs.LG] UPDATED)
    (2 min) A long-cherished vision in robotics is to equip robots with skills that match the versatility and precision of humans. For example, when playing table tennis, a robot should be capable of returning the ball in various ways while precisely placing it at the desired location. A common approach to model such versatile behavior is to use a Mixture of Experts (MoE) model, where each expert is a contextual motion primitive. However, learning such MoEs is challenging as most objectives force the model to cover the entire context space, which prevents specialization of the primitives resulting in rather low-quality components. Starting from maximum entropy reinforcement learning (RL), we decompose the objective into optimizing an individual lower bound per mixture component. Further, we introduce a curriculum by allowing the components to focus on a local context region, enabling the model to learn highly accurate skill representations. To this end, we use local context distributions that are adapted jointly with the expert primitives. Our lower bound advocates an iterative addition of new components, where new components will concentrate on local context regions not covered by the current MoE. This local and incremental learning results in a modular MoE model of high accuracy and versatility, where both properties can be scaled by adding more components on the fly. We demonstrate this by an extensive ablation and on two challenging simulated robot skill learning tasks. We compare our achieved performance to LaDiPS and HiREPS, a known hierarchical policy search method for learning diverse skills.
    Lifted Model Checking for Relational MDPs. (arXiv:2106.11735v2 [cs.LG] UPDATED)
    (0 min) Probabilistic model checking has been developed for verifying systems that have stochastic and nondeterministic behavior. Given a probabilistic system, a probabilistic model checker takes a property and checks whether or not the property holds in that system. For this reason, probabilistic model checking provide rigorous guarantees. So far, however, probabilistic model checking has focused on propositional models where a state is represented by a symbol. On the other hand, it is commonly required to make relational abstractions in planning and reinforcement learning. Various frameworks handle relational domains, for instance, STRIPS planning and relational Markov Decision Processes. Using propositional model checking in relational settings requires one to ground the model, which leads to the well known state explosion problem and intractability. We present pCTL-REBEL, a lifted model checking approach for verifying pCTL properties of relational MDPs. It extends REBEL, a relational model-based reinforcement learning technique, toward relational pCTL model checking. PCTL-REBEL is lifted, which means that rather than grounding, the model exploits symmetries to reason about a group of objects as a whole at the relational level. Theoretically, we show that pCTL model checking is decidable for relational MDPs that have a possibly infinite domain, provided that the states have a bounded size. Practically, we contribute algorithms and an implementation of lifted relational model checking, and we show that the lifted approach improves the scalability of the model checking approach.
    GECKO: Reconciling Privacy, Accuracy and Efficiency in Embedded Deep Learning. (arXiv:2010.00912v3 [cs.CR] UPDATED)
    (0 min) Embedded systems demand on-device processing of data using Neural Networks (NNs) while conforming to the memory, power and computation constraints, leading to an efficiency and accuracy tradeoff. To bring NNs to edge devices, several optimizations such as model compression through pruning, quantization, and off-the-shelf architectures with efficient design have been extensively adopted. These algorithms when deployed to real world sensitive applications, requires to resist inference attacks to protect privacy of users training data. However, resistance against inference attacks is not accounted for designing NN models for IoT. In this work, we analyse the three-dimensional privacy-accuracy-efficiency tradeoff in NNs for IoT devices and propose Gecko training methodology where we explicitly add resistance to private inferences as a design objective. We optimize the inference-time memory, computation, and power constraints of embedded devices as a criterion for designing NN architecture while also preserving privacy. We choose quantization as design choice for highly efficient and private models. This choice is driven by the observation that compressed models leak more information compared to baseline models while off-the-shelf efficient architectures indicate poor efficiency and privacy tradeoff. We show that models trained using Gecko methodology are comparable to prior defences against black-box membership attacks in terms of accuracy and privacy while providing efficiency.
    Introduction to Multi-Armed Bandits. (arXiv:1904.07272v7 [cs.LG] UPDATED)
    (2 min) Multi-armed bandits a simple but very powerful framework for algorithms that make decisions over time under uncertainty. An enormous body of work has accumulated over the years, covered in several books and surveys. This book provides a more introductory, textbook-like treatment of the subject. Each chapter tackles a particular line of work, providing a self-contained, teachable technical introduction and a brief review of the further developments; many of the chapters conclude with exercises. The book is structured as follows. The first four chapters are on IID rewards, from the basic model to impossibility results to Bayesian priors to Lipschitz rewards. The next three chapters cover adversarial rewards, from the full-feedback version to adversarial bandits to extensions with linear rewards and combinatorially structured actions. Chapter 8 is on contextual bandits, a middle ground between IID and adversarial bandits in which the change in reward distributions is completely explained by observable contexts. The last three chapters cover connections to economics, from learning in repeated games to bandits with supply/budget constraints to exploration in the presence of incentives. The appendix provides sufficient background on concentration and KL-divergence. The chapters on "bandits with similarity information", "bandits with knapsacks" and "bandits and agents" can also be consumed as standalone surveys on the respective topics.
    Glance and Focus Networks for Dynamic Visual Recognition. (arXiv:2201.03014v1 [cs.CV])
    (2 min) Spatial redundancy widely exists in visual recognition tasks, i.e., discriminative features in an image or video frame usually correspond to only a subset of pixels, while the remaining regions are irrelevant to the task at hand. Therefore, static models which process all the pixels with an equal amount of computation result in considerable redundancy in terms of time and space consumption. In this paper, we formulate the image recognition problem as a sequential coarse-to-fine feature learning process, mimicking the human visual system. Specifically, the proposed Glance and Focus Network (GFNet) first extracts a quick global representation of the input image at a low resolution scale, and then strategically attends to a series of salient (small) regions to learn finer features. The sequential process naturally facilitates adaptive inference at test time, as it can be terminated once the model is sufficiently confident about its prediction, avoiding further redundant computation. It is worth noting that the problem of locating discriminant regions in our model is formulated as a reinforcement learning task, thus requiring no additional manual annotations other than classification labels. GFNet is general and flexible as it is compatible with any off-the-shelf backbone models (such as MobileNets, EfficientNets and TSM), which can be conveniently deployed as the feature extractor. Extensive experiments on a variety of image classification and video recognition tasks and with various backbone models demonstrate the remarkable efficiency of our method. For example, it reduces the average latency of the highly efficient MobileNet-V3 on an iPhone XS Max by 1.3x without sacrificing accuracy. Code and pre-trained models are available at https://github.com/blackfeather-wang/GFNet-Pytorch.
    AGMI: Attention-Guided Multi-omics Integration for Drug Response Prediction with Graph Neural Networks. (arXiv:2112.08366v2 [q-bio.GN] UPDATED)
    (2 min) Accurate drug response prediction (DRP) is a crucial yet challenging task in precision medicine. This paper presents a novel Attention-Guided Multi-omics Integration (AGMI) approach for DRP, which first constructs a Multi-edge Graph (MeG) for each cell line, and then aggregates multi-omics features to predict drug response using a novel structure, called Graph edge-aware Network (GeNet). For the first time, our AGMI approach explores gene constraint based multi-omics integration for DRP with the whole-genome using GNNs. Empirical experiments on the CCLE and GDSC datasets show that our AGMI largely outperforms state-of-the-art DRP methods by 8.3%--34.2% on four metrics. Our data and code are available at https://github.com/yivan-WYYGDSG/AGMI.
    Fair and efficient contribution valuation for vertical federated learning. (arXiv:2201.02658v1 [cs.LG])
    (2 min) Federated learning is a popular technology for training machine learning models on distributed data sources without sharing data. Vertical federated learning or feature-based federated learning applies to the cases that different data sources share the same sample ID space but differ in feature space. To ensure the data owners' long-term engagement, it is critical to objectively assess the contribution from each data source and recompense them accordingly. The Shapley value (SV) is a provably fair contribution valuation metric originated from cooperative game theory. However, computing the SV requires extensively retraining the model on each subset of data sources, which causes prohibitively high communication costs in federated learning. We propose a contribution valuation metric called vertical federated Shapley value (VerFedSV) based on SV. We show that VerFedSV not only satisfies many desirable properties for fairness but is also efficient to compute, and can be adapted to both synchronous and asynchronous vertical federated learning algorithms. Both theoretical analysis and extensive experimental results verify the fairness, efficiency, and adaptability of VerFedSV.
    Data-Efficient Information Extraction from Form-Like Documents. (arXiv:2201.02647v1 [cs.LG])
    (2 min) Automating information extraction from form-like documents at scale is a pressing need due to its potential impact on automating business workflows across many industries like financial services, insurance, and healthcare. The key challenge is that form-like documents in these business workflows can be laid out in virtually infinitely many ways; hence, a good solution to this problem should generalize to documents with unseen layouts and languages. A solution to this problem requires a holistic understanding of both the textual segments and the visual cues within a document, which is non-trivial. While the natural language processing and computer vision communities are starting to tackle this problem, there has not been much focus on (1) data-efficiency, and (2) ability to generalize across different document types and languages. In this paper, we show that when we have only a small number of labeled documents for training (~50), a straightforward transfer learning approach from a considerably structurally-different larger labeled corpus yields up to a 27 F1 point improvement over simply training on the small corpus in the target domain. We improve on this with a simple multi-domain transfer learning approach, that is currently in production use, and show that this yields up to a further 8 F1 point improvement. We make the case that data efficiency is critical to enable information extraction systems to scale to handle hundreds of different document-types, and learning good representations is critical to accomplishing this.
    Hyperspectral Image Denoising Using Non-convex Local Low-rank and Sparse Separation with Spatial-Spectral Total Variation Regularization. (arXiv:2201.02812v1 [eess.IV])
    (2 min) In this paper, we propose a novel nonconvex approach to robust principal component analysis for HSI denoising, which focuses on simultaneously developing more accurate approximations to both rank and column-wise sparsity for the low-rank and sparse components, respectively. In particular, the new method adopts the log-determinant rank approximation and a novel $\ell_{2,\log}$ norm, to restrict the local low-rank or column-wisely sparse properties for the component matrices, respectively. For the $\ell_{2,\log}$-regularized shrinkage problem, we develop an efficient, closed-form solution, which is named $\ell_{2,\log}$-shrinkage operator. The new regularization and the corresponding operator can be generally used in other problems that require column-wise sparsity. Moreover, we impose the spatial-spectral total variation regularization in the log-based nonconvex RPCA model, which enhances the global piece-wise smoothness and spectral consistency from the spatial and spectral views in the recovered HSI. Extensive experiments on both simulated and real HSIs demonstrate the effectiveness of the proposed method in denoising HSIs.
    Weak Supervision for Affordable Modeling of Electrocardiogram Data. (arXiv:2201.02936v1 [eess.SP])
    (2 min) Analysing electrocardiograms (ECGs) is an inexpensive and non-invasive, yet powerful way to diagnose heart disease. ECG studies using Machine Learning to automatically detect abnormal heartbeats so far depend on large, manually annotated datasets. While collecting vast amounts of unlabeled data can be straightforward, the point-by-point annotation of abnormal heartbeats is tedious and expensive. We explore the use of multiple weak supervision sources to learn diagnostic models of abnormal heartbeats via human designed heuristics, without using ground truth labels on individual data points. Our work is among the first to define weak supervision sources directly on time series data. Results show that with as few as six intuitive time series heuristics, we are able to infer high quality probabilistic label estimates for over 100,000 heartbeats with little human effort, and use the estimated labels to train competitive classifiers evaluated on held out test data.
    Attention-based Random Forest and Contamination Model. (arXiv:2201.02880v1 [cs.LG])
    (2 min) A new approach called ABRF (the attention-based random forest) and its modifications for applying the attention mechanism to the random forest (RF) for regression and classification are proposed. The main idea behind the proposed ABRF models is to assign attention weights with trainable parameters to decision trees in a specific way. The weights depend on the distance between an instance, which falls into a corresponding leaf of a tree, and instances, which fall in the same leaf. This idea stems from representation of the Nadaraya-Watson kernel regression in the form of a RF. Three modifications of the general approach are proposed. The first one is based on applying the Huber's contamination model and on computing the attention weights by solving quadratic or linear optimization problems. The second and the third modifications use the gradient-based algorithms for computing trainable parameters. Numerical experiments with various regression and classification datasets illustrate the proposed method.
    Posture Prediction for Healthy Sitting using a Smart Chair. (arXiv:2201.02615v1 [cs.LG])
    (2 min) Poor sitting habits have been identified as a risk factor to musculoskeletal disorders and lower back pain especially on the elderly, disabled people, and office workers. In the current computerized world, even while involved in leisure or work activity, people tend to spend most of their days sitting at computer desks. This can result in spinal pain and related problems. Therefore, a means to remind people about their sitting habits and provide recommendations to counterbalance, such as physical exercise, is important. Posture recognition for seated postures have not received enough attention as most works focus on standing postures. Wearable sensors, pressure or force sensors, videos and images were used for posture recognition in the literature. The aim of this study is to build Machine Learning models for classifying sitting posture of a person by analyzing data collected from a chair platted with two 32 by 32 pressure sensors at its seat and backrest. Models were built using five algorithms: Random Forest (RF), Gaussian Na\"ive Bayes, Logistic Regression, Support Vector Machine and Deep Neural Network (DNN). All the models are evaluated using KFold cross-validation technique. This paper presents experiments conducted using the two separate datasets, controlled and realistic, and discusses results achieved at classifying six sitting postures. Average classification accuracies of 98% and 97% were achieved on the controlled and realistic datasets, respectively.
    Clustering Text Using Attention. (arXiv:2201.02816v1 [cs.CL])
    (2 min) Clustering Text has been an important problem in the domain of Natural Language Processing. While there are techniques to cluster text based on using conventional clustering techniques on top of contextual or non-contextual vector space representations, it still remains a prevalent area of research possible to various improvements in performance and implementation of these techniques. This paper discusses a novel technique to cluster text using attention mechanisms. Attention Mechanisms have proven to be highly effective in various NLP tasks in recent times. This paper extends the idea of attention mechanism in clustering space and sheds some light on a whole new area of research
    Causal Discovery from Sparse Time-Series Data Using Echo State Network. (arXiv:2201.02933v1 [cs.LG])
    (2 min) Causal discovery between collections of time-series data can help diagnose causes of symptoms and hopefully prevent faults before they occur. However, reliable causal discovery can be very challenging, especially when the data acquisition rate varies (i.e., non-uniform data sampling), or in the presence of missing data points (e.g., sparse data sampling). To address these issues, we proposed a new system comprised of two parts, the first part fills missing data with a Gaussian Process Regression, and the second part leverages an Echo State Network, which is a type of reservoir computer (i.e., used for chaotic system modeling) for Causal discovery. We evaluate the performance of our proposed system against three other off-the-shelf causal discovery algorithms, namely, structural expectation-maximization, sub-sampled linear auto-regression absolute coefficients, and multivariate Granger Causality with vector auto-regressive using the Tennessee Eastman chemical dataset; we report on their corresponding Matthews Correlation Coefficient(MCC) and Receiver Operating Characteristic curves (ROC) and show that the proposed system outperforms existing algorithms, demonstrating the viability of our approach to discover causal relationships in a complex system with missing entries.
    Reconfigurable Intelligent Surface Enabled Spatial Multiplexing with Fully Convolutional Network. (arXiv:2201.02834v1 [eess.SP])
    (2 min) Reconfigurable intelligent surface (RIS) is an emerging technology for future wireless communication systems. In this work, we consider downlink spatial multiplexing enabled by the RIS for weighted sum-rate (WSR) maximization. In the literature, most solutions use alternating gradient-based optimization, which has moderate performance, high complexity, and limited scalability. We propose to apply a fully convolutional network (FCN) to solve this problem, which was originally designed for semantic segmentation of images. The rectangular shape of the RIS and the spatial correlation of channels with adjacent RIS antennas due to the short distance between them encourage us to apply it for the RIS configuration. We design a set of channel features that includes both cascaded channels via the RIS and the direct channel. In the base station (BS), the differentiable minimum mean squared error (MMSE) precoder is used for pretraining and the weighted minimum mean squared error (WMMSE) precoder is then applied for fine-tuning, which is nondifferentiable, more complex, but achieves a better performance. Evaluation results show that the proposed solution has higher performance and allows for a faster evaluation than the baselines. Hence it scales better to a large number of antennas, advancing the RIS one step closer to practical deployment.
    Cognitive Computing to Optimize IT Services. (arXiv:2201.02737v1 [cs.CL])
    (2 min) In this paper, the challenges of maintaining a healthy IT operational environment have been addressed by proactively analyzing IT Service Desk tickets, customer satisfaction surveys, and social media data. A Cognitive solution goes beyond the traditional structured data analysis by deep analyses of both structured and unstructured text. The salient features of the proposed platform include language identification, translation, hierarchical extraction of the most frequently occurring topics, entities and their relationships, text summarization, sentiments, and knowledge extraction from the unstructured text using Natural Language Processing techniques. Moreover, the insights from unstructured text combined with structured data allow the development of various classification, segmentation, and time-series forecasting use-cases on the incident, problem, and change datasets. Further, the text and predictive insights together with raw data are used for visualization and exploration of actionable insights on a rich and interactive dashboard. However, it is hard not only to find these insights using traditional structured data analysis but it might also take a very long time to discover them, especially while dealing with a massive amount of unstructured data. By taking action on these insights, organizations can benefit from a significant reduction of ticket volume, reduced operational costs, and increased customer satisfaction. In various experiments, on average, upto 18-25% of yearly ticket volume has been reduced using the proposed approach.
    A Deep Learning Approach to Integrate Human-Level Understanding in a Chatbot. (arXiv:2201.02735v1 [cs.CL])
    (2 min) In recent times, a large number of people have been involved in establishing their own businesses. Unlike humans, chatbots can serve multiple customers at a time, are available 24/7 and reply in less than a fraction of a second. Though chatbots perform well in task-oriented activities, in most cases they fail to understand personalized opinions, statements or even queries which later impact the organization for poor service management. Lack of understanding capabilities in bots disinterest humans to continue conversations with them. Usually, chatbots give absurd responses when they are unable to interpret a user's text accurately. Extracting the client reviews from conversations by using chatbots, organizations can reduce the major gap of understanding between the users and the chatbot and improve their quality of products and services.Thus, in our research we incorporated all the key elements that are necessary for a chatbot to analyse and understand an input text precisely and accurately. We performed sentiment analysis, emotion detection, intent classification and named-entity recognition using deep learning to develop chatbots with humanistic understanding and intelligence. The efficiency of our approach can be demonstrated accordingly by the detailed analysis.
    Improved Input Reprogramming for GAN Conditioning. (arXiv:2201.02692v1 [cs.LG])
    (2 min) We study the GAN conditioning problem, whose goal is to convert a pretrained unconditional GAN into a conditional GAN using labeled data. We first identify and analyze three approaches to this problem -- conditional GAN training from scratch, fine-tuning, and input reprogramming. Our analysis reveals that when the amount of labeled data is small, input reprogramming performs the best. Motivated by real-world scenarios with scarce labeled data, we focus on the input reprogramming approach and carefully analyze the existing algorithm. After identifying a few critical issues of the previous input reprogramming approach, we propose a new algorithm called InRep+. Our algorithm InRep+ addresses the existing issues with the novel uses of invertible neural networks and Positive-Unlabeled (PU) learning. Via extensive experiments, we show that InRep+ outperforms all existing methods, particularly when label information is scarce, noisy, and/or imbalanced. For instance, for the task of conditioning a CIFAR10 GAN with 1% labeled data, InRep+ achieves an average Intra-FID of 82.13, whereas the second-best method achieves 114.51.
    Block Walsh-Hadamard Transform Based Binary Layers in Deep Neural Networks. (arXiv:2201.02711v1 [cs.LG])
    (2 min) Convolution has been the core operation of modern deep neural networks. It is well-known that convolutions can be implemented in the Fourier Transform domain. In this paper, we propose to use binary block Walsh-Hadamard transform (WHT) instead of the Fourier transform. We use WHT-based binary layers to replace some of the regular convolution layers in deep neural networks. We utilize both one-dimensional (1-D) and two-dimensional (2-D) binary WHTs in this paper. In both 1-D and 2-D layers, we compute the binary WHT of the input feature map and denoise the WHT domain coefficients using a nonlinearity which is obtained by combining soft-thresholding with the tanh function. After denoising, we compute the inverse WHT. We use 1D-WHT to replace the $1\times 1$ convolutional layers, and 2D-WHT layers can replace the 3$\times$3 convolution layers and Squeeze-and-Excite layers. 2D-WHT layers with trainable weights can be also inserted before the Global Average Pooling (GAP) layers to assist the dense layers. In this way, we can reduce the number of trainable parameters significantly with a slight decrease in trainable parameters. In this paper, we implement the WHT layers into MobileNet-V2, MobileNet-V3-Large, and ResNet to reduce the number of parameters significantly with negligible accuracy loss. Moreover, according to our speed test, the 2D-FWHT layer runs about 24 times as fast as the regular $3\times 3$ convolution with 19.51\% less RAM usage in an NVIDIA Jetson Nano experiment.
    Assessing Policy, Loss and Planning Combinations in Reinforcement Learning using a New Modular Architecture. (arXiv:2201.02874v1 [cs.LG])
    (2 min) The model-based reinforcement learning paradigm, which uses planning algorithms and neural network models, has recently achieved unprecedented results in diverse applications, leading to what is now known as deep reinforcement learning. These agents are quite complex and involve multiple components, factors that can create challenges for research. In this work, we propose a new modular software architecture suited for these types of agents, and a set of building blocks that can be easily reused and assembled to construct new model-based reinforcement learning agents. These building blocks include planning algorithms, policies, and loss functions. We illustrate the use of this architecture by combining several of these building blocks to implement and test agents that are optimized to three different test environments: Cartpole, Minigrid, and Tictactoe. One particular planning algorithm, made available in our implementation and not previously used in reinforcement learning, which we called averaged minimax, achieved good results in the three tested environments. Experiments performed with this architecture have shown that the best combination of planning algorithm, policy, and loss function is heavily problem dependent. This result provides evidence that the proposed architecture, which is modular and reusable, is useful for reinforcement learning researchers who want to study new environments and techniques.
    Bitcoin Price Predictive Modeling Using Expert Correction. (arXiv:2201.02729v1 [q-fin.ST])
    (2 min) The paper studies the linear model for Bitcoin price which includes regression features based on Bitcoin currency statistics, mining processes, Google search trends, Wikipedia pages visits. The pattern of deviation of regression model prediction from real prices is simpler comparing to price time series. It is assumed that this pattern can be predicted by an experienced expert. In such a way, using the combination of the regression model and expert correction, one can receive better results than with either regression model or expert opinion only. It is shown that Bayesian approach makes it possible to utilize the probabilistic approach using distributions with fat tails and take into account the outliers in Bitcoin price time series.
    Open-Set Recognition of Breast Cancer Treatments. (arXiv:2201.02923v1 [cs.LG])
    (2 min) Open-set recognition generalizes a classification task by classifying test samples as one of the known classes from training or "unknown." As novel cancer drug cocktails with improved treatment are continually discovered, predicting cancer treatments can naturally be formulated in terms of an open-set recognition problem. Drawbacks, due to modeling unknown samples during training, arise from straightforward implementations of prior work in healthcare open-set learning. Accordingly, we reframe the problem methodology and apply a recent existing Gaussian mixture variational autoencoder model, which achieves state-of-the-art results for image datasets, to breast cancer patient data. Not only do we obtain more accurate and robust classification results, with a 24.5% average F1 increase compared to a recent method, but we also reexamine open-set recognition in terms of deployability to a clinical setting.
    Spatio-Temporal Graph Representation Learning for Fraudster Group Detection. (arXiv:2201.02621v1 [cs.LG])
    (2 min) Motivated by potential financial gain, companies may hire fraudster groups to write fake reviews to either demote competitors or promote their own businesses. Such groups are considerably more successful in misleading customers, as people are more likely to be influenced by the opinion of a large group. To detect such groups, a common model is to represent fraudster groups' static networks, consequently overlooking the longitudinal behavior of a reviewer thus the dynamics of co-review relations among reviewers in a group. Hence, these approaches are incapable of excluding outlier reviewers, which are fraudsters intentionally camouflaging themselves in a group and genuine reviewers happen to co-review in fraudster groups. To address this issue, in this work, we propose to first capitalize on the effectiveness of the HIN-RNN in both reviewers' representation learning while capturing the collaboration between reviewers, we first utilize the HIN-RNN to model the co-review relations of reviewers in a group in a fixed time window of 28 days. We refer to this as spatial relation learning representation to signify the generalisability of this work to other networked scenarios. Then we use an RNN on the spatial relations to predict the spatio-temporal relations of reviewers in the group. In the third step, a Graph Convolution Network (GCN) refines the reviewers' vector representations using these predicted relations. These refined representations are then used to remove outlier reviewers. The average of the remaining reviewers' representation is then fed to a simple fully connected layer to predict if the group is a fraudster group or not. Exhaustive experiments of the proposed approach showed a 5% (4%), 12% (5%), 12% (5%) improvement over three of the most recent approaches on precision, recall, and F1-value over the Yelp (Amazon) dataset, respectively.
    Optimal 1-Wasserstein Distance for WGANs. (arXiv:2201.02824v1 [stat.ML])
    (2 min) The mathematical forces at work behind Generative Adversarial Networks raise challenging theoretical issues. Motivated by the important question of characterizing the geometrical properties of the generated distributions, we provide a thorough analysis of Wasserstein GANs (WGANs) in both the finite sample and asymptotic regimes. We study the specific case where the latent space is univariate and derive results valid regardless of the dimension of the output space. We show in particular that for a fixed sample size, the optimal WGANs are closely linked with connected paths minimizing the sum of the squared Euclidean distances between the sample points. We also highlight the fact that WGANs are able to approach (for the 1-Wasserstein distance) the target distribution as the sample size tends to infinity, at a given convergence rate and provided the family of generative Lipschitz functions grows appropriately. We derive in passing new results on optimal transport theory in the semi-discrete setting.
    Attacking Vertical Collaborative Learning System Using Adversarial Dominating Inputs. (arXiv:2201.02775v1 [cs.CR])
    (2 min) Vertical collaborative learning system also known as vertical federated learning (VFL) system has recently become prominent as a concept to process data distributed across many individual sources without the need to centralize it. Multiple participants collaboratively train models based on their local data in a privacy-preserving manner. To date, VFL has become a de facto solution to securely learn a model among organizations, allowing knowledge to be shared without compromising privacy of any individual organizations. Despite the prosperous development of VFL systems, we find that certain inputs of a participant, named adversarial dominating inputs (ADIs), can dominate the joint inference towards the direction of the adversary's will and force other (victim) participants to make negligible contributions, losing rewards that are usually offered regarding the importance of their contributions in collaborative learning scenarios. We conduct a systematic study on ADIs by first proving their existence in typical VFL systems. We then propose gradient-based methods to synthesize ADIs of various formats and exploit common VFL systems. We further launch greybox fuzz testing, guided by the resiliency score of "victim" participants, to perturb adversary-controlled inputs and systematically explore the VFL attack surface in a privacy-preserving manner. We conduct an in-depth study on the influence of critical parameters and settings in synthesizing ADIs. Our study reveals new VFL attack opportunities, promoting the identification of unknown threats before breaches and building more secure VFL systems.
    PocketNN: Integer-only Training and Inference of Neural Networks via Direct Feedback Alignment and Pocket Activations in Pure C++. (arXiv:2201.02863v1 [cs.LG])
    (2 min) Standard deep learning algorithms are implemented using floating-point real numbers. This presents an obstacle for implementing them on low-end devices which may not have dedicated floating-point units (FPUs). As a result, researchers in TinyML have considered machine learning algorithms that can train and run a deep neural network (DNN) on a low-end device using integer operations only. In this paper we propose PocketNN, a light and self-contained proof-of-concept framework in pure C++ for the training and inference of DNNs using only integers. Unlike other approaches, PocketNN directly operates on integers without requiring any explicit quantization algorithms or customized fixed-point formats. This was made possible by pocket activations, which are a family of activation functions devised for integer-only DNNs, and an emerging DNN training algorithm called direct feedback alignment (DFA). Unlike the standard backpropagation (BP), DFA trains each layer independently, thus avoiding integer overflow which is a key problem when using BP with integer-only operations. We used PocketNN to train some DNNs on two well-known datasets, MNIST and Fashion-MNIST. Our experiments show that the DNNs trained with our PocketNN achieved 96.98% and 87.7% accuracies on MNIST and Fashion-MNIST datasets, respectively. The accuracies are very close to the equivalent DNNs trained using BP with floating-point real number operations, such that accuracy degradations were just 1.02%p and 2.09%p, respectively. Finally, our PocketNN has high compatibility and portability for low-end devices as it is open source and implemented in pure C++ without any dependencies.
    Optimizing the Communication-Accuracy Trade-off in Federated Learning with Rate-Distortion Theory. (arXiv:2201.02664v1 [cs.LG])
    (2 min) A significant bottleneck in federated learning is the network communication cost of sending model updates from client devices to the central server. We propose a method to reduce this cost. Our method encodes quantized updates with an appropriate universal code, taking into account their empirical distribution. Because quantization introduces error, we select quantization levels by optimizing for the desired trade-off in average total bitrate and gradient distortion. We demonstrate empirically that in spite of the non-i.i.d. nature of federated learning, the rate-distortion frontier is consistent across datasets, optimizers, clients and training rounds, and within each setting, distortion reliably predicts model performance. This allows for a remarkably simple compression scheme that is near-optimal in many use cases, and outperforms Top-K, DRIVE, 3LC and QSGD on the Stack Overflow next-word prediction benchmark.
    Deep Generative Modeling for Volume Reconstruction in Cryo-Electron Microscop. (arXiv:2201.02867v1 [eess.IV])
    (2 min) Recent breakthroughs in high resolution imaging of biomolecules in solution with cryo-electron microscopy (cryo-EM) have unlocked new doors for the reconstruction of molecular volumes, thereby promising further advances in biology, chemistry, and pharmacological research amongst others. Despite significant headway, the immense challenges in cryo-EM data analysis remain legion and intricately inter-disciplinary in nature, requiring insights from physicists, structural biologists, computer scientists, statisticians, and applied mathematicians. Meanwhile, recent next-generation volume reconstruction algorithms that combine generative modeling with end-to-end unsupervised deep learning techniques have shown promising results on simulated data, but still face considerable hurdles when applied to experimental cryo-EM images. In light of the proliferation of such methods and given the interdisciplinary nature of the task, we propose here a critical review of recent advances in the field of deep generative modeling for high resolution cryo-EM volume reconstruction. The present review aims to (i) compare and contrast these new methods, while (ii) presenting them from a perspective and using terminology familiar to scientists in each of the five aforementioned fields with no specific background in cryo-EM. The review begins with an introduction to the mathematical and computational challenges of deep generative models for cryo-EM volume reconstruction, along with an overview of the baseline methodology shared across this class of algorithms. Having established the common thread weaving through these different models, we provide a practical comparison of these state-of-the-art algorithms, highlighting their relative strengths and weaknesses, along with the assumptions that they rely on. This allows us to identify bottlenecks in current methods and avenues for future research.
    Stay Positive: Knowledge Graph Embedding Without Negative Sampling. (arXiv:2201.02661v1 [cs.LG])
    (2 min) Knowledge graphs (KGs) are typically incomplete and we often wish to infer new facts given the existing ones. This can be thought of as a binary classification problem; we aim to predict if new facts are true or false. Unfortunately, we generally only have positive examples (the known facts) but we also need negative ones to train a classifier. To resolve this, it is usual to generate negative examples using a negative sampling strategy. However, this can produce false negatives which may reduce performance, is computationally expensive, and does not produce calibrated classification probabilities. In this paper, we propose a training procedure that obviates the need for negative sampling by adding a novel regularization term to the loss function. Our results for two relational embedding models (DistMult and SimplE) show the merit of our proposal both in terms of performance and speed.
    Testing the Robustness of a BiLSTM-based Structural Story Classifier. (arXiv:2201.02733v1 [cs.CL])
    (2 min) The growing prevalence of counterfeit stories on the internet has fostered significant interest towards fast and scalable detection of fake news in the machine learning community. While several machine learning techniques for this purpose have emerged, we observe that there is a need to evaluate the impact of noise on these techniques' performance, where noise constitutes news articles being mistakenly labeled as fake (or real). This work takes a step in that direction, where we examine the impact of noise on a state-of-the-art, structural model based on BiLSTM (Bidirectional Long-Short Term Model) for fake news detection, Hierarchical Discourse-level Structure for Fake News Detection by Karimi and Tang (Reference no. 9).
    Extraction of Product Specifications from the Web -- Going Beyond Tables and Lists. (arXiv:2201.02896v1 [cs.IR])
    (2 min) E-commerce product pages on the web often present product specification data in structured tabular blocks. Extraction of these product attribute-value specifications has benefited applications like product catalogue curation, search, question answering, and others. However, across different Websites, there is a wide variety of HTML elements (like , , , , etc.) typically used to render these blocks that makes their automatic extraction a challenge. Most of the current research has focused on extracting product specifications from tables and lists and, therefore, suffers from recall when applied to a large-scale extraction setting. In this paper, we present a product specification extraction approach that goes beyond tables or lists and generalizes across the diverse HTML elements used for rendering specification blocks. Using a combination of hand-coded features and deep learned spatial and token features, we first identify the specification blocks on a product page. We then extract the product attribute-value pairs from these blocks following an approach inspired by wrapper induction. We created a labeled dataset of product specifications extracted from 14,111 diverse specification blocks taken from a range of different product websites. Our experiments show the efficacy of our approach compared to the current specification extraction models and support our claim about its application to large-scale product specification extraction.
    An Adaptive Device-Edge Co-Inference Framework Based on Soft Actor-Critic. (arXiv:2201.02968v1 [cs.LG])
    (2 min) Recently, the applications of deep neural network (DNN) have been very prominent in many fields such as computer vision (CV) and natural language processing (NLP) due to its superior feature extraction performance. However, the high-dimension parameter model and large-scale mathematical calculation restrict the execution efficiency, especially for Internet of Things (IoT) devices. Different from the previous cloud/edge-only pattern that brings huge pressure for uplink communication and device-only fashion that undertakes unaffordable calculation strength, we highlight the collaborative computation between the device and edge for DNN models, which can achieve a good balance between the communication load and execution accuracy. Specifically, a systematic on-demand co-inference framework is proposed to exploit the multi-branch structure, in which the pre-trained Alexnet is right-sized through \emph{early-exit} and partitioned at an intermediate DNN layer. The integer quantization is enforced to further compress transmission bits. As a result, we establish a new Deep Reinforcement Learning (DRL) optimizer-Soft Actor Critic for discrete (SAC-d), which generates the \emph{exit point}, \emph{partition point}, and \emph{compressing bits} by soft policy iterations. Based on the latency and accuracy aware reward design, such an optimizer can well adapt to the complex environment like dynamic wireless channel and arbitrary CPU processing, and is capable of supporting the 5G URLLC. Real-world experiment on Raspberry Pi 4 and PC shows the outperformance of the proposed solution.
    LoMar: A Local Defense Against Poisoning Attack on Federated Learning. (arXiv:2201.02873v1 [cs.LG])
    (2 min) Federated learning (FL) provides a high efficient decentralized machine learning framework, where the training data remains distributed at remote clients in a network. Though FL enables a privacy-preserving mobile edge computing framework using IoT devices, recent studies have shown that this approach is susceptible to poisoning attacks from the side of remote clients. To address the poisoning attacks on FL, we provide a \textit{two-phase} defense algorithm called {Lo}cal {Ma}licious Facto{r} (LoMar). In phase I, LoMar scores model updates from each remote client by measuring the relative distribution over their neighbors using a kernel density estimation method. In phase II, an optimal threshold is approximated to distinguish malicious and clean updates from a statistical perspective. Comprehensive experiments on four real-world datasets have been conducted, and the experimental results show that our defense strategy can effectively protect the FL system. {Specifically, the defense performance on Amazon dataset under a label-flipping attack indicates that, compared with FG+Krum, LoMar increases the target label testing accuracy from $96.0\%$ to $98.8\%$, and the overall averaged testing accuracy from $90.1\%$ to $97.0\%$.
    Global Convergence Analysis of Deep Linear Networks with A One-neuron Layer. (arXiv:2201.02761v1 [cs.LG])
    (2 min) In this paper, we follow Eftekhari's work to give a non-local convergence analysis of deep linear networks. Specifically, we consider optimizing deep linear networks which have a layer with one neuron under quadratic loss. We describe the convergent point of trajectories with arbitrary starting point under gradient flow, including the paths which converge to one of the saddle points or the original point. We also show specific convergence rates of trajectories that converge to the global minimizer by stages. To achieve these results, this paper mainly extends the machinery in Eftekhari's work to provably identify the rank-stable set and the global minimizer convergent set. We also give specific examples to show the necessity of our definitions. Crucially, as far as we know, our results appear to be the first to give a non-local global analysis of linear neural networks from arbitrary initialized points, rather than the lazy training regime which has dominated the literature of neural networks, and restricted benign initialization in Eftekhari's work. We also note that extending our results to general linear networks without one hidden neuron assumption remains a challenging open problem.
    BottleFit: Learning Compressed Representations in Deep Neural Networks for Effective and Efficient Split Computing. (arXiv:2201.02693v1 [cs.LG])
    (2 min) Although mission-critical applications require the use of deep neural networks (DNNs), their continuous execution at mobile devices results in a significant increase in energy consumption. While edge offloading can decrease energy consumption, erratic patterns in channel quality, network and edge server load can lead to severe disruption of the system's key operations. An alternative approach, called split computing, generates compressed representations within the model (called "bottlenecks"), to reduce bandwidth usage and energy consumption. Prior work has proposed approaches that introduce additional layers, to the detriment of energy consumption and latency. For this reason, we propose a new framework called BottleFit, which, in addition to targeted DNN architecture modifications, includes a novel training strategy to achieve high accuracy even with strong compression rates. We apply BottleFit on cutting-edge DNN models in image classification, and show that BottleFit achieves 77.1% data compression with up to 0.6% accuracy loss on ImageNet dataset, while state of the art such as SPINN loses up to 6% in accuracy. We experimentally measure the power consumption and latency of an image classification application running on an NVIDIA Jetson Nano board (GPU-based) and a Raspberry PI board (GPU-less). We show that BottleFit decreases power consumption and latency respectively by up to 49% and 89% with respect to (w.r.t.) local computing and by 37% and 55% w.r.t. edge offloading. We also compare BottleFit with state-of-the-art autoencoders-based approaches, and show that (i) BottleFit reduces power consumption and execution time respectively by up to 54% and 44% on the Jetson and 40% and 62% on Raspberry PI; (ii) the size of the head model executed on the mobile device is 83 times smaller. The code repository will be published for full reproducibility of the results.
    {\lambda}-Scaled-Attention: A Novel Fast Attention Mechanism for Efficient Modeling of Protein Sequences. (arXiv:2201.02912v1 [cs.LG])
    (2 min) Attention-based deep networks have been successfully applied on textual data in the field of NLP. However, their application on protein sequences poses additional challenges due to the weak semantics of the protein words, unlike the plain text words. These unexplored challenges faced by the standard attention technique include (i) vanishing attention score problem and (ii) high variations in the attention distribution. In this regard, we introduce a novel {\lambda}-scaled attention technique for fast and efficient modeling of the protein sequences that addresses both the above problems. This is used to develop the {\lambda}-scaled attention network and is evaluated for the task of protein function prediction implemented at the protein sub-sequence level. Experiments on the datasets for biological process (BP) and molecular function (MF) showed significant improvements in the F1 score values for the proposed {\lambda}-scaled attention technique over its counterpart approach based on the standard attention technique (+2.01% for BP and +4.67% for MF) and state-of-the-art ProtVecGen-Plus approach (+2.61% for BP and +4.20% for MF). Further, fast convergence (converging in half the number of epochs) and efficient learning (in terms of very low difference between the training and validation losses) were also observed during the training process.
    Lazy Lagrangians with Predictions for Online Learning. (arXiv:2201.02890v1 [cs.LG])
    (2 min) We consider the general problem of online convex optimization with time-varying additive constraints in the presence of predictions for the next cost and constraint functions. A novel primal-dual algorithm is designed by combining a Follow-The-Regularized-Leader iteration with prediction-adaptive dynamic steps. The algorithm achieves $\mathcal O(T^{\frac{3-\beta}{4}})$ regret and $\mathcal O(T^{\frac{1+\beta}{2}})$ constraint violation bounds that are tunable via parameter $\beta\!\in\![1/2,1)$ and have constant factors that shrink with the predictions quality, achieving eventually $\mathcal O(1)$ regret for perfect predictions. Our work extends the FTRL framework for this constrained OCO setting and outperforms the respective state-of-the-art greedy-based solutions, without imposing conditions on the quality of predictions, the cost functions or the geometry of constraints, beyond convexity.
    A Multi-agent Reinforcement Learning Approach for Efficient Client Selection in Federated Learning. (arXiv:2201.02932v1 [cs.LG])
    (2 min) Federated learning (FL) is a training technique that enables client devices to jointly learn a shared model by aggregating locally-computed models without exposing their raw data. While most of the existing work focuses on improving the FL model accuracy, in this paper, we focus on the improving the training efficiency, which is often a hurdle for adopting FL in real-world applications. Specifically, we design an efficient FL framework which jointly optimizes model accuracy, processing latency and communication efficiency, all of which are primary design considerations for real implementation of FL. Inspired by the recent success of Multi-Agent Reinforcement Learning (MARL) in solving complex control problems, we present \textit{FedMarl}, an MARL-based FL framework which performs efficient run-time client selection. Experiments show that FedMarl can significantly improve model accuracy with much lower processing latency and communication cost.
    An Improved Mathematical Model of Sepsis: Modeling, Bifurcation Analysis, and Optimal Control Study for Complex Nonlinear Infectious Disease System. (arXiv:2201.02702v1 [math.DS])
    (2 min) Sepsis is a life-threatening medical emergency, which is a major cause of death worldwide and the second highest cause of mortality in the United States. Researching the optimal control treatment or intervention strategy on the comprehensive sepsis system is key in reducing mortality. For this purpose, first, this paper improves a complex nonlinear sepsis model proposed in our previous work. Then, bifurcation analyses are conducted for each sepsis subsystem to study the model behaviors under some system parameters. The bifurcation analysis results also further indicate the necessity of control treatment and intervention therapy. If the sepsis system is without adding any control under some parameter and initial system value settings, the system will perform persistent inflammation outcomes as time goes by. Therefore, we develop our complex improved nonlinear sepsis model into a sepsis optimal control model, and then use some effective biomarkers recommended in existing clinic practices as optimization objective function to measure the development of sepsis. Besides that, a Bayesian optimization algorithm by combining Recurrent neural network (RNN-BO algorithm) is introduced to predict the optimal control strategy for the studied sepsis optimal control system. The difference between the RNN-BO algorithm from other optimization algorithms is that once given any new initial system value setting (initial value is associated with the initial conditions of patients), the RNN-BO algorithm is capable of quickly predicting a corresponding time-series optimal control based on the historical optimal control data for any new sepsis patient. To demonstrate the effectiveness and efficiency of the RNN-BO algorithm on solving the optimal control solution on the complex nonlinear sepsis system, some numerical simulations are implemented by comparing with other optimization algorithms in this paper.
    Unsupervised Machine Learning for Exploratory Data Analysis of Exoplanet Transmission Spectra. (arXiv:2201.02696v1 [astro-ph.EP])
    (2 min) Transit spectroscopy is a powerful tool to decode the chemical composition of the atmospheres of extrasolar planets. In this paper we focus on unsupervised techniques for analyzing spectral data from transiting exoplanets. We demonstrate methods for i) cleaning and validating the data, ii) initial exploratory data analysis based on summary statistics (estimates of location and variability), iii) exploring and quantifying the existing correlations in the data, iv) pre-processing and linearly transforming the data to its principal components, v) dimensionality reduction and manifold learning, vi) clustering and anomaly detection, vii) visualization and interpretation of the data. To illustrate the proposed unsupervised methodology, we use a well-known public benchmark data set of synthetic transit spectra. We show that there is a high degree of correlation in the spectral data, which calls for appropriate low-dimensional representations. We explore a number of different techniques for such dimensionality reduction and identify several suitable options in terms of summary statistics, principal components, etc. We uncover interesting structures in the principal component basis, namely, well-defined branches corresponding to different chemical regimes of the underlying atmospheres. We demonstrate that those branches can be successfully recovered with a K-means clustering algorithm in fully unsupervised fashion. We advocate for a three-dimensional representation of the spectroscopic data in terms of the first three principal components, in order to reveal the existing structure in the data and quickly characterize the chemical class of a planet.
    A Sneak Attack on Segmentation of Medical Images Using Deep Neural Network Classifiers. (arXiv:2201.02771v1 [eess.IV])
    (2 min) Instead of using current deep-learning segmentation models (like the UNet and variants), we approach the segmentation problem using trained Convolutional Neural Network (CNN) classifiers, which automatically extract important features from classified targets for image classification. Those extracted features can be visualized and formed heatmaps using Gradient-weighted Class Activation Mapping (Grad-CAM). This study tested whether the heatmaps could be used to segment the classified targets. We also proposed an evaluation method for the heatmaps; that is, to re-train the CNN classifier using images filtered by heatmaps and examine its performance. We used the mean-Dice coefficient to evaluate segmentation results. Results from our experiments show that heatmaps can locate and segment partial tumor areas. But only use of the heatmaps from CNN classifiers may not be an optimal approach for segmentation. In addition, we have verified that the predictions of CNN classifiers mainly depend on tumor areas, and dark regions in Grad-CAM's heatmaps also contribute to classification.
    Provable Clustering of a Union of Linear Manifolds Using Optimal Directions. (arXiv:2201.02745v1 [cs.LG])
    (2 min) This paper focuses on the Matrix Factorization based Clustering (MFC) method which is one of the few closed form algorithms for the subspace clustering problem. Despite being simple, closed-form, and computation-efficient, MFC can outperform the other sophisticated subspace clustering methods in many challenging scenarios. We reveal the connection between MFC and the Innovation Pursuit (iPursuit) algorithm which was shown to be able to outperform the other spectral clustering based methods with a notable margin especially when the span of clusters are close. A novel theoretical study is presented which sheds light on the key performance factors of both algorithms (MFC/iPursuit) and it is shown that both algorithms can be robust to notable intersections between the span of clusters. Importantly, in contrast to the theoretical guarantees of other algorithms which emphasized on the distance between the subspaces as the key performance factor and without making the innovation assumption, it is shown that the performance of MFC/iPursuit mainly depends on the distance between the innovative components of the clusters.
    DeHIN: A Decentralized Framework for Embedding Large-scale Heterogeneous Information Networks. (arXiv:2201.02757v1 [cs.LG])
    (2 min) Modeling heterogeneity by extraction and exploitation of high-order information from heterogeneous information networks (HINs) has been attracting immense research attention in recent times. Such heterogeneous network embedding (HNE) methods effectively harness the heterogeneity of small-scale HINs. However, in the real world, the size of HINs grow exponentially with the continuous introduction of new nodes and different types of links, making it a billion-scale network. Learning node embeddings on such HINs creates a performance bottleneck for existing HNE methods that are commonly centralized, i.e., complete data and the model are both on a single machine. To address large-scale HNE tasks with strong efficiency and effectiveness guarantee, we present \textit{Decentralized Embedding Framework for Heterogeneous Information Network} (DeHIN) in this paper. In DeHIN, we generate a distributed parallel pipeline that utilizes hypergraphs in order to infuse parallelization into the HNE task. DeHIN presents a context preserving partition mechanism that innovatively formulates a large HIN as a hypergraph, whose hyperedges connect semantically similar nodes. Our framework then adopts a decentralized strategy to efficiently partition HINs by adopting a tree-like pipeline. Then, each resulting subnetwork is assigned to a distributed worker, which employs the deep information maximization theorem to locally learn node embeddings from the partition it receives. We further devise a novel embedding alignment scheme to precisely project independently learned node embeddings from all subnetworks onto a common vector space, thus allowing for downstream tasks like link prediction and node classification.
    Neighbor2vec: an efficient and effective method for Graph Embedding. (arXiv:2201.02626v1 [cs.SI])
    (2 min) Graph embedding techniques have led to significant progress in recent years. However, present techniques are not effective enough to capture the patterns of networks. This paper propose neighbor2vec, a neighbor-based sampling strategy used algorithm to learn the neighborhood representations of node, a framework to gather the structure information by feature propagation between the node and its neighbors. We claim that neighbor2vec is a simple and effective approach to enhancing the scalability as well as equality of graph embedding, and it breaks the limits of the existing state-of-the-art unsupervised techniques. We conduct experiments on several node classification and link prediction tasks for networks such as ogbn-arxiv, ogbn-products, ogbn-proteins, ogbl-ppa,ogbl-collab and ogbl-citation2. The result shows that Neighbor2vec's representations provide an average accuracy scores up to 6.8 percent higher than competing methods in node classification tasks and 3.0 percent higher in link prediction tasks. The neighbor2vec's representations are able to outperform all baseline methods and two classical GNN models in all six experiments.
    Attention Option-Critic. (arXiv:2201.02628v1 [cs.LG])
    (2 min) Temporal abstraction in reinforcement learning is the ability of an agent to learn and use high-level behaviors, called options. The option-critic architecture provides a gradient-based end-to-end learning method to construct options. We propose an attention-based extension to this framework, which enables the agent to learn to focus different options on different aspects of the observation space. We show that this leads to behaviorally diverse options which are also capable of state abstraction, and prevents the degeneracy problems of option domination and frequent option switching that occur in option-critic, while achieving a similar sample complexity. We also demonstrate the more efficient, interpretable, and reusable nature of the learned options in comparison with option-critic, through different transfer learning tasks. Experimental results in a relatively simple four-rooms environment and the more complex ALE (Arcade Learning Environment) showcase the efficacy of our approach.
    Machine Learning-Based Disease Diagnosis:A Bibliometric Analysis. (arXiv:2201.02755v1 [cs.LG])
    (2 min) Machine Learning (ML) has garnered considerable attention from researchers and practitioners as a new and adaptable tool for disease diagnosis. With the advancement of ML and the proliferation of papers and research in this field, a complete examination of Machine Learning-Based Disease Diagnosis (MLBDD) is required. From a bibliometrics standpoint, this article comprehensively studies MLBDD papers from 2012 to 2021. Consequently, with particular keywords, 1710 papers with associate information have been extracted from the Scopus and Web of Science (WOS) database and integrated into the excel datasheet for further analysis. First, we examine the publication structures based on yearly publications and the most productive countries/regions, institutions, and authors. Second, the co-citation networks of countries/regions, institutions, authors, and articles are visualized using R-studio software. They are further examined in terms of citation structure and the most influential ones. This article gives an overview of MLBDD for researchers interested in the subject and conducts a thorough and complete study of MLBDD for those interested in conducting more research in this field.
    A Fair and Efficient Hybrid Federated Learning Framework based on XGBoost for Distributed Power Prediction. (arXiv:2201.02783v1 [cs.LG])
    (2 min) In a modern power system, real-time data on power generation/consumption and its relevant features are stored in various distributed parties, including household meters, transformer stations and external organizations. To fully exploit the underlying patterns of these distributed data for accurate power prediction, federated learning is needed as a collaborative but privacy-preserving training scheme. However, current federated learning frameworks are polarized towards addressing either the horizontal or vertical separation of data, and tend to overlook the case where both are present. Furthermore, in mainstream horizontal federated learning frameworks, only artificial neural networks are employed to learn the data patterns, which are considered less accurate and interpretable compared to tree-based models on tabular datasets. To this end, we propose a hybrid federated learning framework based on XGBoost, for distributed power prediction from real-time external features. In addition to introducing boosted trees to improve accuracy and interpretability, we combine horizontal and vertical federated learning, to address the scenario where features are scattered in local heterogeneous parties and samples are scattered in various local districts. Moreover, we design a dynamic task allocation scheme such that each party gets a fair share of information, and the computing power of each party can be fully leveraged to boost training efficiency. A follow-up case study is presented to justify the necessity of adopting the proposed framework. The advantages of the proposed framework in fairness, efficiency and accuracy performance are also confirmed.
    Agricultural Plant Cataloging and Establishment of a Data Framework from UAV-based Crop Images by Computer Vision. (arXiv:2201.02885v1 [cs.CV])
    (2 min) UAV-based image retrieval in modern agriculture enables gathering large amounts of spatially referenced crop image data. In large-scale experiments, however, UAV images suffer from containing a multitudinous amount of crops in a complex canopy architecture. Especially for the observation of temporal effects, this complicates the recognition of individual plants over several images and the extraction of relevant information tremendously. In this work, we present a hands-on workflow for the automatized temporal and spatial identification and individualization of crop images from UAVs abbreviated as "cataloging" based on comprehensible computer vision methods. We evaluate the workflow on two real-world datasets. One dataset is recorded for observation of Cercospora leaf spot - a fungal disease - in sugar beet over an entire growing cycle. The other one deals with harvest prediction of cauliflower plants. The plant catalog is utilized for the extraction of single plant images seen over multiple time points. This gathers large-scale spatio-temporal image dataset that in turn can be applied to train further machine learning models including various data layers. The presented approach improves analysis and interpretation of UAV data in agriculture significantly. By validation with some reference data, our method shows an accuracy that is similar to more complex deep learning-based recognition techniques. Our workflow is able to automatize plant cataloging and training image extraction, especially for large datasets.
    A fall alert system with prior-fall activity identification. (arXiv:2201.02803v1 [cs.LG])
    (2 min) Falling, especially in the elderly, is a critical issue to care for and surveil. There have been many studies focusing on fall detection. However, from our survey, there is still no research indicating the prior-fall activities, which we believe that they have a strong correlation with the intensity of the fall. The purpose of this research is to develop a fall alert system that also identifies prior-fall activities. First, we want to find a suitable location to attach a sensor to the body. We created multiple-spot on-body devices to collect various activity data. We used that dataset to train 5 different classification models. We selected the XGBoost classification model for detecting a prior-fall activity and the chest location for use in fall detection from a comparison of the detection accuracy. We then tested 3 existing fall detection threshold algorithms to detect fall and fall to their knees first, and selected the 3-phase threshold algorithm of Chaitep and Chawachat [3] in our system. From the experiment, we found that the fall detection accuracy is 88.91%, the fall to their knees first detection accuracy is 91.25%, and the average accuracy of detection of prior-fall activities is 86.25%. Although we use an activity dataset of young to middle-aged adults (18-49 years), we are confident that this system can be developed to monitor activities before the fall, especially in the elderly, so that caretakers can better manage the situation.
  • stat.ML updates on arXiv.org

    When is Offline Two-Player Zero-Sum Markov Game Solvable?. (arXiv:2201.03522v1 [cs.LG])
    (2 min) We study what dataset assumption permits solving offline two-player zero-sum Markov game. In stark contrast to the offline single-agent Markov decision process, we show that the single strategy concentration assumption is insufficient for learning the Nash equilibrium (NE) strategy in offline two-player zero-sum Markov games. On the other hand, we propose a new assumption named unilateral concentration and design a pessimism-type algorithm that is provably efficient under this assumption. In addition, we show that the unilateral concentration assumption is necessary for learning an NE strategy. Furthermore, our algorithm can achieve minimax sample complexity without any modification for two widely studied settings: dataset with uniform concentration assumption and turn-based Markov game. Our work serves as an important initial step towards understanding offline multi-agent reinforcement learning.
    Understanding Layer-wise Contributions in Deep Neural Networks through Spectral Analysis. (arXiv:2111.03972v3 [cs.LG] UPDATED)
    (2 min) Spectral analysis is a powerful tool, decomposing any function into simpler parts. In machine learning, Mercer's theorem generalizes this idea, providing for any kernel and input distribution a natural basis of functions of increasing frequency. More recently, several works have extended this analysis to deep neural networks through the framework of Neural Tangent Kernel. In this work, we analyze the layer-wise spectral bias of Deep Neural Networks and relate it to the contributions of different layers in the reduction of generalization error for a given target function. We utilize the properties of Hermite polynomials and Spherical Harmonics to prove that initial layers exhibit a larger bias towards high-frequency functions defined on the unit sphere. We further provide empirical results validating our theory in high dimensional datasets for Deep Neural Networks.
    Masked Gradient-Based Causal Structure Learning. (arXiv:1910.08527v3 [cs.LG] UPDATED)
    (2 min) This paper studies the problem of learning causal structures from observational data. We reformulate the Structural Equation Model (SEM) with additive noises in a form parameterized by binary graph adjacency matrix and show that, if the original SEM is identifiable, then the binary adjacency matrix can be identified up to super-graphs of the true causal graph under mild conditions. We then utilize the reformulated SEM to develop a causal structure learning method that can be efficiently trained using gradient-based optimization, by leveraging a smooth characterization on acyclicity and the Gumbel-Softmax approach to approximate the binary adjacency matrix. It is found that the obtained entries are typically near zero or one and can be easily thresholded to identify the edges. We conduct experiments on synthetic and real datasets to validate the effectiveness of the proposed method, and show that it readily includes different smooth model functions and achieves a much improved performance on most datasets considered.
    On the Double Descent of Random Features Models Trained with SGD. (arXiv:2110.06910v4 [stat.ML] UPDATED)
    (2 min) This paper studies generalization properties of random features (RF) regression in high dimensions optimized by stochastic gradient descent (SGD). In this regime, we derive precise non-asymptotic error bounds of RF regression under both constant and adaptive step-size SGD setting, and observe the double descent phenomenon both theoretically and empirically. Our analysis shows how to cope with multiple randomness sources of initialization, label noise, and data sampling (as well as stochastic gradients) with no closed-form solution, and also goes beyond the commonly-used Gaussian/spherical data assumption. Our theoretical results demonstrate that, with SGD training, RF regression still generalizes well for interpolation learning, and is able to characterize the double descent behavior by the unimodality of variance and monotonic decrease of bias. Besides, we also prove that the constant step-size SGD setting incurs no loss in convergence rate when compared to the exact minimal-norm interpolator, as a theoretical justification of using SGD in practice.
    Optimal Experimental Design for Staggered Rollouts. (arXiv:1911.03764v3 [econ.EM] UPDATED)
    (2 min) In this paper, we study the problem of designing experiments that are conducted on a set of units such as users or groups of users in an online marketplace, for multiple time periods such as weeks or months. These experiments are particularly useful to study the treatments that have causal effects on both current and future outcomes (instantaneous and lagged effects). The design problem involves selecting a treatment time for each unit, before or during the experiment, in order to most precisely estimate the instantaneous and lagged effects, post experimentation. This optimization of the treatment decisions can directly minimize the opportunity cost of the experiment by reducing its sample size requirement. The optimization is an NP-hard integer program for which we provide a near-optimal solution, when the design decisions are performed all at the beginning (fixed-sample-size designs). Next, we study sequential experiments that allow adaptive decisions during the experiments, and also potentially early stop the experiments, further reducing their cost. However, the sequential nature of these experiments complicates both the design phase and the estimation phase. We propose a new algorithm, PGAE, that addresses these challenges by adaptively making treatment decisions, estimating the treatment effects, and drawing valid post-experimentation inference. PGAE combines ideas from Bayesian statistics, dynamic programming, and sample splitting. Using synthetic experiments on real data sets from multiple domains, we demonstrate that our proposed solutions for fixed-sample-size and sequential experiments reduce the opportunity cost of the experiments by over 50% and 70%, respectively, compared to benchmarks.
    Comparing Sequential Forecasters. (arXiv:2110.00115v3 [stat.ME] UPDATED)
    (2 min) Consider two or more forecasters, each making a sequence of predictions for different events over time. We ask a relatively basic question: how might we compare these forecasters, either online or post-hoc, while avoiding unverifiable assumptions on how the forecasts or outcomes were generated? This work presents a novel and rigorous answer to this question. We design a sequential inference procedure for estimating the time-varying difference in forecast quality as measured by any scoring rule. The resulting confidence intervals are nonasymptotically valid and can be continuously monitored to yield statistically valid comparisons at arbitrary data-dependent stopping times ("anytime-valid"); this is enabled by adapting variance-adaptive supermartingales, confidence sequences, and e-processes to our setting. Motivated by Shafer and Vovk's game-theoretic probability, our coverage guarantees are also distribution-free, in the sense that they make no distributional assumptions on the forecasts or outcomes. In contrast to a recent work by Henzi and Ziegel, our tools can sequentially test a weak null hypothesis about whether one forecaster outperforms another on average over time. We demonstrate their effectiveness by comparing probability forecasts on Major League Baseball (MLB) games and statistical postprocessing methods for ensemble weather forecasts.
    Bayesian Consistency with the Supremum Metric. (arXiv:2201.03447v1 [math.ST])
    (2 min) We present simple conditions for Bayesian consistency in the supremum metric. The key to the technique is a triangle inequality which allows us to explicitly use weak convergence, a consequence of the standard Kullback--Leibler support condition for the prior. A further condition is to ensure that smoothed versions of densities are not too far from the original density, thus dealing with densities which could track the data too closely. A key result of the paper is that we demonstrate supremum consistency using weaker conditions compared to those currently used to secure $\mathbb{L}_1$ consistency.
    Traversing the Local Polytopes of ReLU Neural Networks: A Unified Approach for Network Verification. (arXiv:2111.08922v2 [cs.LG] UPDATED)
    (2 min) Although neural networks (NNs) with ReLU activation functions have found success in a wide range of applications, their adoption in risk-sensitive settings has been limited by the concerns on robustness and interpretability. Previous works to examine robustness and to improve interpretability partially exploited the piecewise linear function form of ReLU NNs. In this paper, we explore the unique topological structure that ReLU NNs create in the input space, identifying the adjacency among the partitioned local polytopes and developing a traversing algorithm based on this adjacency. Our polytope traversing algorithm can be adapted to verify a wide range of network properties related to robustness and interpretability, providing an unified approach to examine the network behavior. As the traversing algorithm explicitly visits all local polytopes, it returns a clear and full picture of the network behavior within the traversed region. The time and space complexity of the traversing algorithm is determined by the number of a ReLU NN's partitioning hyperplanes passing through the traversing region.
    Cluster Regularization via a Hierarchical Feature Regression. (arXiv:2107.04831v2 [stat.ML] UPDATED)
    (2 min) This paper proposes a novel graph-based regularized regression estimator - the hierarchical feature regression (HFR) -, which mobilizes insights from the domains of machine learning and graph theory to estimate robust parameters for a linear regression. The estimator constructs a supervised feature graph that decomposes parameters along its edges, adjusting first for common variation and successively incorporating idiosyncratic patterns into the fitting process. The graph structure has the effect of shrinking parameters towards group targets, where the extent of shrinkage is governed by a hyperparamter, and group compositions as well as shrinkage targets are determined endogenously. The method offers rich resources for the visual exploration of the latent effect structure in the data, and demonstrates good predictive accuracy and versatility when compared to a panel of commonly used regularization techniques across a range of empirical and simulated regression tasks.
    Stability Based Generalization Bounds for Exponential Family Langevin Dynamics. (arXiv:2201.03064v1 [cs.LG])
    (2 min) We study generalization bounds for noisy stochastic mini-batch iterative algorithms based on the notion of stability. Recent years have seen key advances in data-dependent generalization bounds for noisy iterative learning algorithms such as stochastic gradient Langevin dynamics (SGLD) based on stability (Mou et al., 2018; Li et al., 2020) and information theoretic approaches (Xu and Raginsky, 2017; Negrea et al., 2019; Steinke and Zakynthinou, 2020; Haghifam et al., 2020). In this paper, we unify and substantially generalize stability based generalization bounds and make three technical advances. First, we bound the generalization error of general noisy stochastic iterative algorithms (not necessarily gradient descent) in terms of expected (not uniform) stability. The expected stability can in turn be bounded by a Le Cam Style Divergence. Such bounds have a O(1/n) sample dependence unlike many existing bounds with O(1/\sqrt{n}) dependence. Second, we introduce Exponential Family Langevin Dynamics(EFLD) which is a substantial generalization of SGLD and which allows exponential family noise to be used with stochastic gradient descent (SGD). We establish data-dependent expected stability based generalization bounds for general EFLD algorithms. Third, we consider an important special case of EFLD: noisy sign-SGD, which extends sign-SGD using Bernoulli noise over {-1,+1}. Generalization bounds for noisy sign-SGD are implied by that of EFLD and we also establish optimization guarantees for the algorithm. Further, we present empirical results on benchmark datasets to illustrate that our bounds are non-vacuous and quantitatively much sharper than existing bounds.
    Permuted and Unlinked Monotone Regression in $\mathbb{R}^d$: an approach based on mixture modeling and optimal transport. (arXiv:2201.03528v1 [math.ST])
    (2 min) Suppose that we have a regression problem with response variable Y in $\mathbb{R}^d$ and predictor X in $\mathbb{R}^d$, for $d \geq 1$. In permuted or unlinked regression we have access to separate unordered data on X and Y, as opposed to data on (X,Y)-pairs in usual regression. So far in the literature the case $d=1$ has received attention, see e.g., the recent papers by Rigollet and Weed [Information & Inference, 8, 619--717] and Balabdaoui et al. [J. Mach. Learn. Res., 22(172), 1--60]. In this paper, we consider the general multivariate setting with $d \geq 1$. We show that the notion of cyclical monotonicity of the regression function is sufficient for identification and estimation in the permuted/unlinked regression model. We study permutation recovery in the permuted regression setting and develop a computationally efficient and easy-to-use algorithm for denoising based on the Kiefer-Wolfowitz [Ann. Math. Statist., 27, 887--906] nonparametric maximum likelihood estimator and techniques from the theory of optimal transport. We provide explicit upper bounds on the associated mean squared denoising error for Gaussian noise. As in previous work on the case $d = 1$, the permuted/unlinked setting involves slow (logarithmic) rates of convergence rooting in the underlying deconvolution problem. Numerical studies corroborate our theoretical analysis and show that the proposed approach performs at least on par with the methods in the aforementioned prior work in the case $d = 1$ while achieving substantial reductions in terms of computational complexity.
    Uncovering the Source of Machine Bias. (arXiv:2201.03092v1 [cs.LG])
    (2 min) We develop a structural econometric model to capture the decision dynamics of human evaluators on an online micro-lending platform, and estimate the model parameters using a real-world dataset. We find two types of biases in gender, preference-based bias and belief-based bias, are present in human evaluators' decisions. Both types of biases are in favor of female applicants. Through counterfactual simulations, we quantify the effect of gender bias on loan granting outcomes and the welfare of the company and the borrowers. Our results imply that both the existence of the preference-based bias and that of the belief-based bias reduce the company's profits. When the preference-based bias is removed, the company earns more profits. When the belief-based bias is removed, the company's profits also increase. Both increases result from raising the approval probability for borrowers, especially male borrowers, who eventually pay back loans. For borrowers, the elimination of either bias decreases the gender gap of the true positive rates in the credit risk evaluation. We also train machine learning algorithms on both the real-world data and the data from the counterfactual simulations. We compare the decisions made by those algorithms to see how evaluators' biases are inherited by the algorithms and reflected in machine-based decisions. We find that machine learning algorithms can mitigate both the preference-based bias and the belief-based bias.
    Introduction to Multi-Armed Bandits. (arXiv:1904.07272v7 [cs.LG] UPDATED)
    (2 min) Multi-armed bandits a simple but very powerful framework for algorithms that make decisions over time under uncertainty. An enormous body of work has accumulated over the years, covered in several books and surveys. This book provides a more introductory, textbook-like treatment of the subject. Each chapter tackles a particular line of work, providing a self-contained, teachable technical introduction and a brief review of the further developments; many of the chapters conclude with exercises. The book is structured as follows. The first four chapters are on IID rewards, from the basic model to impossibility results to Bayesian priors to Lipschitz rewards. The next three chapters cover adversarial rewards, from the full-feedback version to adversarial bandits to extensions with linear rewards and combinatorially structured actions. Chapter 8 is on contextual bandits, a middle ground between IID and adversarial bandits in which the change in reward distributions is completely explained by observable contexts. The last three chapters cover connections to economics, from learning in repeated games to bandits with supply/budget constraints to exploration in the presence of incentives. The appendix provides sufficient background on concentration and KL-divergence. The chapters on "bandits with similarity information", "bandits with knapsacks" and "bandits and agents" can also be consumed as standalone surveys on the respective topics.
    Surrogate-assisted performance prediction for data-driven knowledge discovery algorithms: application to evolutionary modeling of clinical pathways. (arXiv:2004.01123v2 [cs.LG] UPDATED)
    (2 min) The paper proposes and investigates an approach for surrogate-assisted performance prediction of data-driven knowledge discovery algorithms. The approach is based on the identification of surrogate models for prediction of the target algorithm's quality and performance. The proposed approach was implemented and investigated as applied to an evolutionary algorithm for discovering clusters of interpretable clinical pathways in electronic health records of patients with acute coronary syndrome. Several clustering metrics and execution time were used as the target quality and performance metrics respectively. An analytical software prototype based on the proposed approach for the prediction of algorithm characteristics and feature analysis was developed to provide a more interpretable prediction of the target algorithm's performance and quality that can be further used for parameter tuning.
    Loss-calibrated expectation propagation for approximate Bayesian decision-making. (arXiv:2201.03128v1 [stat.ML])
    (2 min) Approximate Bayesian inference methods provide a powerful suite of tools for finding approximations to intractable posterior distributions. However, machine learning applications typically involve selecting actions, which -- in a Bayesian setting -- depend on the posterior distribution only via its contribution to expected utility. A growing body of work on loss-calibrated approximate inference methods has therefore sought to develop posterior approximations sensitive to the influence of the utility function. Here we introduce loss-calibrated expectation propagation (Loss-EP), a loss-calibrated variant of expectation propagation. This method resembles standard EP with an additional factor that "tilts" the posterior towards higher-utility decisions. We show applications to Gaussian process classification under binary utility functions with asymmetric penalties on False Negative and False Positive errors, and show how this asymmetry can have dramatic consequences on what information is "useful" to capture in an approximation.
    Individual Privacy Accounting via a Renyi Filter. (arXiv:2008.11193v4 [cs.CR] UPDATED)
    (2 min) We consider a sequential setting in which a single dataset of individuals is used to perform adaptively-chosen analyses, while ensuring that the differential privacy loss of each participant does not exceed a pre-specified privacy budget. The standard approach to this problem relies on bounding a worst-case estimate of the privacy loss over all individuals and all possible values of their data, for every single analysis. Yet, in many scenarios this approach is overly conservative, especially for "typical" data points which incur little privacy loss by participation in most of the analyses. In this work, we give a method for tighter privacy loss accounting based on the value of a personalized privacy loss estimate for each individual in each analysis. To implement the accounting method we design a filter for R\'enyi differential privacy. A filter is a tool that ensures that the privacy parameter of a composed sequence of algorithms with adaptively-chosen privacy parameters does not exceed a pre-specified budget. Our filter is simpler and tighter than the known filter for $(\epsilon,\delta)$-differential privacy by Rogers et al. We apply our results to the analysis of noisy gradient descent and show that personalized accounting can be practical, easy to implement, and can only make the privacy-utility tradeoff tighter.
    A Cross Validation framework for Signal Denoising with Applications to Trend Filtering, Dyadic CART and Beyond. (arXiv:2201.02654v1 [math.ST])
    (2 min) This paper formulates a general cross validation framework for signal denoising. The general framework is then applied to nonparametric regression methods such as Trend Filtering and Dyadic CART. The resulting cross validated versions are then shown to attain nearly the same rates of convergence as are known for the optimally tuned analogues. There did not exist any previous theoretical analyses of cross validated versions of Trend Filtering or Dyadic CART. To illustrate the generality of the framework we also propose and study cross validated versions of two fundamental estimators; lasso for high dimensional linear regression and singular value thresholding for matrix estimation. Our general framework is inspired by the ideas in Chatterjee and Jafarov (2015) and is potentially applicable to a wide range of estimation methods which use tuning parameters.
    Smooth Nested Simulation: Bridging Cubic and Square Root Convergence Rates in High Dimensions. (arXiv:2201.02958v1 [stat.ME])
    (2 min) Nested simulation concerns estimating functionals of a conditional expectation via simulation. In this paper, we propose a new method based on kernel ridge regression to exploit the smoothness of the conditional expectation as a function of the multidimensional conditioning variable. Asymptotic analysis shows that the proposed method can effectively alleviate the curse of dimensionality on the convergence rate as the simulation budget increases, provided that the conditional expectation is sufficiently smooth. The smoothness bridges the gap between the cubic root convergence rate (that is, the optimal rate for the standard nested simulation) and the square root convergence rate (that is, the canonical rate for the standard Monte Carlo simulation). We demonstrate the performance of the proposed method via numerical examples from portfolio risk management and input uncertainty quantification.
    Optimizing the Communication-Accuracy Trade-off in Federated Learning with Rate-Distortion Theory. (arXiv:2201.02664v1 [cs.LG])
    (2 min) A significant bottleneck in federated learning is the network communication cost of sending model updates from client devices to the central server. We propose a method to reduce this cost. Our method encodes quantized updates with an appropriate universal code, taking into account their empirical distribution. Because quantization introduces error, we select quantization levels by optimizing for the desired trade-off in average total bitrate and gradient distortion. We demonstrate empirically that in spite of the non-i.i.d. nature of federated learning, the rate-distortion frontier is consistent across datasets, optimizers, clients and training rounds, and within each setting, distortion reliably predicts model performance. This allows for a remarkably simple compression scheme that is near-optimal in many use cases, and outperforms Top-K, DRIVE, 3LC and QSGD on the Stack Overflow next-word prediction benchmark.
    Attention-based Random Forest and Contamination Model. (arXiv:2201.02880v1 [cs.LG])
    (2 min) A new approach called ABRF (the attention-based random forest) and its modifications for applying the attention mechanism to the random forest (RF) for regression and classification are proposed. The main idea behind the proposed ABRF models is to assign attention weights with trainable parameters to decision trees in a specific way. The weights depend on the distance between an instance, which falls into a corresponding leaf of a tree, and instances, which fall in the same leaf. This idea stems from representation of the Nadaraya-Watson kernel regression in the form of a RF. Three modifications of the general approach are proposed. The first one is based on applying the Huber's contamination model and on computing the attention weights by solving quadratic or linear optimization problems. The second and the third modifications use the gradient-based algorithms for computing trainable parameters. Numerical experiments with various regression and classification datasets illustrate the proposed method.
    Optimal 1-Wasserstein Distance for WGANs. (arXiv:2201.02824v1 [stat.ML])
    (2 min) The mathematical forces at work behind Generative Adversarial Networks raise challenging theoretical issues. Motivated by the important question of characterizing the geometrical properties of the generated distributions, we provide a thorough analysis of Wasserstein GANs (WGANs) in both the finite sample and asymptotic regimes. We study the specific case where the latent space is univariate and derive results valid regardless of the dimension of the output space. We show in particular that for a fixed sample size, the optimal WGANs are closely linked with connected paths minimizing the sum of the squared Euclidean distances between the sample points. We also highlight the fact that WGANs are able to approach (for the 1-Wasserstein distance) the target distribution as the sample size tends to infinity, at a given convergence rate and provided the family of generative Lipschitz functions grows appropriately. We derive in passing new results on optimal transport theory in the semi-discrete setting.
    AIDA: An Active Inference-based Design Agent for Audio Processing Algorithms. (arXiv:2112.13366v2 [eess.AS] UPDATED)
    (2 min) In this paper we present AIDA, which is an active inference-based agent that iteratively designs a personalized audio processing algorithm through situated interactions with a human client. The target application of AIDA is to propose on-the-spot the most interesting alternative values for the tuning parameters of a hearing aid (HA) algorithm, whenever a HA client is not satisfied with their HA performance. AIDA interprets searching for the "most interesting alternative" as an issue of optimal (acoustic) context-aware Bayesian trial design. In computational terms, AIDA is realized as an active inference-based agent with an Expected Free Energy criterion for trial design. This type of architecture is inspired by neuro-economic models on efficient (Bayesian) trial design in brains and implies that AIDA comprises generative probabilistic models for acoustic signals and user responses. We propose a novel generative model for acoustic signals as a sum of time-varying auto-regressive filters and a user response model based on a Gaussian Process Classifier. The full AIDA agent has been implemented in a factor graph for the generative model and all tasks (parameter learning, acoustic context classification, trial design, etc.) are realized by variational message passing on the factor graph. All verification and validation experiments and demonstrations are freely accessible at our GitHub repository.
    Lazy Lagrangians with Predictions for Online Learning. (arXiv:2201.02890v1 [cs.LG])
    (2 min) We consider the general problem of online convex optimization with time-varying additive constraints in the presence of predictions for the next cost and constraint functions. A novel primal-dual algorithm is designed by combining a Follow-The-Regularized-Leader iteration with prediction-adaptive dynamic steps. The algorithm achieves $\mathcal O(T^{\frac{3-\beta}{4}})$ regret and $\mathcal O(T^{\frac{1+\beta}{2}})$ constraint violation bounds that are tunable via parameter $\beta\!\in\![1/2,1)$ and have constant factors that shrink with the predictions quality, achieving eventually $\mathcal O(1)$ regret for perfect predictions. Our work extends the FTRL framework for this constrained OCO setting and outperforms the respective state-of-the-art greedy-based solutions, without imposing conditions on the quality of predictions, the cost functions or the geometry of constraints, beyond convexity.
    Neural-PDE: A RNN based neural network for solving time dependent PDEs. (arXiv:2009.03892v3 [math.NA] UPDATED)
    (2 min) Partial differential equations (PDEs) play a crucial role in studying a vast number of problems in science and engineering. Numerically solving nonlinear and/or high-dimensional PDEs is often a challenging task. Inspired by the traditional finite difference and finite elements methods and emerging advancements in machine learning, we propose a sequence deep learning framework called Neural-PDE, which allows to automatically learn governing rules of any time-dependent PDE system from existing data by using a bidirectional LSTM encoder, and predict the next n time steps data. One critical feature of our proposed framework is that the Neural-PDE is able to simultaneously learn and simulate the multiscale variables.We test the Neural-PDE by a range of examples from one-dimensional PDEs to a high-dimensional and nonlinear complex fluids model. The results show that the Neural-PDE is capable of learning the initial conditions, boundary conditions and differential operators without the knowledge of the specific form of a PDE system.In our experiments the Neural-PDE can efficiently extract the dynamics within 20 epochs training, and produces accurate predictions. Furthermore, unlike the traditional machine learning approaches in learning PDE such as CNN and MLP which require vast parameters for model precision, Neural-PDE shares parameters across all time steps, thus considerably reduces the computational complexity and leads to a fast learning algorithm.
    SCROLLS: Standardized CompaRison Over Long Language Sequences. (arXiv:2201.03533v1 [cs.CL])
    (2 min) NLP benchmarks have largely focused on short texts, such as sentences and paragraphs, even though long texts comprise a considerable amount of natural language in the wild. We introduce SCROLLS, a suite of tasks that require reasoning over long texts. We examine existing long-text datasets, and handpick ones where the text is naturally long, while prioritizing tasks that involve synthesizing information across the input. SCROLLS contains summarization, question answering, and natural language inference tasks, covering multiple domains, including literature, science, business, and entertainment. Initial baselines, including Longformer Encoder-Decoder, indicate that there is ample room for improvement on SCROLLS. We make all datasets available in a unified text-to-text format and host a live leaderboard to facilitate research on model architecture and pretraining methods.
    Accelerated Gradient Methods for Sparse Statistical Learning with Nonconvex Penalties. (arXiv:2009.10629v3 [math.OC] UPDATED)
    (2 min) Nesterov's accelerated gradient (AG) is a popular technique to optimize objective functions comprising two components: a convex loss and a penalty function. While AG methods perform well for convex penalties, such as the LASSO, convergence issues may arise when it is applied to nonconvex penalties, such as SCAD. A recent proposal generalizes Nesterov's AG method to the nonconvex setting but has never been applied to sparse statistical learning problems. There are several hyperparameters to be set before running the proposed algorithm. However, there is no explicit rule as to how the hyperparameters should be selected. In this article, we consider the application of this nonconvex AG algorithm to high-dimensional linear and logistic sparse learning problems, and propose a hyperparameter setting based on the complexity upper bound to accelerate convergence. We further establish the rate of convergence and present a simple and useful bound for the damping sequence. Simulation studies show that convergence can be made, on average, considerably faster than that of the conventional ISTA algorithm. Our experiments also show that the proposed method generally outperforms the current state-of-the-art method in terms of signal recovery.
    Robust classification with flexible discriminant analysis in heterogeneous data. (arXiv:2201.02967v1 [stat.ML])
    (2 min) Linear and Quadratic Discriminant Analysis are well-known classical methods but can heavily suffer from non-Gaussian distributions and/or contaminated datasets, mainly because of the underlying Gaussian assumption that is not robust. To fill this gap, this paper presents a new robust discriminant analysis where each data point is drawn by its own arbitrary Elliptically Symmetrical (ES) distribution and its own arbitrary scale parameter. Such a model allows for possibly very heterogeneous, independent but non-identically distributed samples. After deriving a new decision rule, it is shown that maximum-likelihood parameter estimation and classification are very simple, fast and robust compared to state-of-the-art methods.
    An application of the splitting-up method for the computation of a neural network representation for the solution for the filtering equations. (arXiv:2201.03283v1 [math.PR])
    (2 min) The filtering equations govern the evolution of the conditional distribution of a signal process given partial, and possibly noisy, observations arriving sequentially in time. Their numerical approximation plays a central role in many real-life applications, including numerical weather prediction, finance and engineering. One of the classical approaches to approximate the solution of the filtering equations is to use a PDE inspired method, called the splitting-up method, initiated by Gyongy, Krylov, LeGland, among other contributors. This method, and other PDE based approaches, have particular applicability for solving low-dimensional problems. In this work we combine this method with a neural network representation. The new methodology is used to produce an approximation of the unnormalised conditional distribution of the signal process. We further develop a recursive normalisation procedure to recover the normalised conditional distribution of the signal process. The new scheme can be iterated over multiple time steps whilst keeping its asymptotic unbiasedness property intact. We test the neural network approximations with numerical approximation results for the Kalman and Benes filter.
    Selecting the Best Optimizing System. (arXiv:2201.03065v1 [stat.ME])
    (2 min) We formulate selecting the best optimizing system (SBOS) problems and provide solutions for those problems. In an SBOS problem, a finite number of systems are contenders. Inside each system, a continuous decision variable affects the system's expected performance. An SBOS problem compares different systems based on their expected performances under their own optimally chosen decision to select the best, without advance knowledge of expected performances of the systems nor the optimizing decision inside each system. We design easy-to-implement algorithms that adaptively chooses a system and a choice of decision to evaluate the noisy system performance, sequentially eliminates inferior systems, and eventually recommends a system as the best after spending a user-specified budget. The proposed algorithms integrate the stochastic gradient descent method and the sequential elimination method to simultaneously exploit the structure inside each system and make comparisons across systems. For the proposed algorithms, we prove exponential rates of convergence to zero for the probability of false selection, as the budget grows to infinity. We conduct three numerical examples that represent three practical cases of SBOS problems. Our proposed algorithms demonstrate consistent and stronger performances in terms of the probability of false selection over benchmark algorithms under a range of problem settings and sampling budgets.
    Retiring Adult: New Datasets for Fair Machine Learning. (arXiv:2108.04884v3 [cs.LG] UPDATED)
    (2 min) Although the fairness community has recognized the importance of data, researchers in the area primarily rely on UCI Adult when it comes to tabular data. Derived from a 1994 US Census survey, this dataset has appeared in hundreds of research papers where it served as the basis for the development and comparison of many algorithmic fairness interventions. We reconstruct a superset of the UCI Adult data from available US Census sources and reveal idiosyncrasies of the UCI Adult dataset that limit its external validity. Our primary contribution is a suite of new datasets derived from US Census surveys that extend the existing data ecosystem for research on fair machine learning. We create prediction tasks relating to income, employment, health, transportation, and housing. The data span multiple years and all states of the United States, allowing researchers to study temporal shift and geographic variation. We highlight a broad initial sweep of new empirical insights relating to trade-offs between fairness criteria, performance of algorithmic interventions, and the role of distribution shift based on our new datasets. Our findings inform ongoing debates, challenge some existing narratives, and point to future research directions. Our datasets are available at https://github.com/zykls/folktables.
    Differentially Private $\ell_1$-norm Linear Regression with Heavy-tailed Data. (arXiv:2201.03204v1 [cs.LG])
    (2 min) We study the problem of Differentially Private Stochastic Convex Optimization (DP-SCO) with heavy-tailed data. Specifically, we focus on the $\ell_1$-norm linear regression in the $\epsilon$-DP model. While most of the previous work focuses on the case where the loss function is Lipschitz, here we only need to assume the variates has bounded moments. Firstly, we study the case where the $\ell_2$ norm of data has bounded second order moment. We propose an algorithm which is based on the exponential mechanism and show that it is possible to achieve an upper bound of $\tilde{O}(\sqrt{\frac{d}{n\epsilon}})$ (with high probability). Next, we relax the assumption to bounded $\theta$-th order moment with some $\theta\in (1, 2)$ and show that it is possible to achieve an upper bound of $\tilde{O}(({\frac{d}{n\epsilon}})^\frac{\theta-1}{\theta})$. Our algorithms can also be extended to more relaxed cases where only each coordinate of the data has bounded moments, and we can get an upper bound of $\tilde{O}({\frac{d}{\sqrt{n\epsilon}}})$ and $\tilde{O}({\frac{d}{({n\epsilon})^\frac{\theta-1}{\theta}}})$ in the second and $\theta$-th moment case respectively.
    Learning polytopes with fixed facet directions. (arXiv:2201.03419v1 [math.MG])
    (2 min) We consider the task of reconstructing polytopes with fixed facet directions from finitely many support function evaluations. We show that for fixed simplicial normal fan the least-squares estimate is given by a convex quadratic program. We study the geometry of the solution set and give a combinatorial characterization for the uniqueness of the reconstruction in this case. We provide an algorithm that, under mild assumptions, converges to the unknown input shape as the number of noisy support function evaluations increases. We also discuss limitations of our results if the restriction on the normal fan is removed.
    Effective Sample Size, Dimensionality, and Generalization in Covariate Shift Adaptation. (arXiv:2010.01184v5 [stat.ML] UPDATED)
    (2 min) In supervised learning, training and test datasets are often sampled from distinct distributions. Domain adaptation techniques are thus required. Covariate shift adaptation yields good generalization performance when domains differ only by the marginal distribution of features. Covariate shift adaptation is usually implemented using importance weighting, which may fail, according to common wisdom, due to small effective sample sizes (ESS). Previous research argues this scenario is more common in high-dimensional settings. However, how effective sample size, dimensionality, and model performance/generalization are formally related in supervised learning, considering the context of covariate shift adaptation, is still somewhat obscure in the literature. Thus, a main challenge is presenting a unified theory connecting those points. Hence, in this paper, we focus on building a unified view connecting the ESS, data dimensionality, and generalization in the context of covariate shift adaptation. Moreover, we also demonstrate how dimensionality reduction or feature selection can increase the ESS, and argue that our results support dimensionality reduction before covariate shift adaptation as a good practice.
    The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models. (arXiv:2201.03544v1 [cs.LG])
    (2 min) Reward hacking -- where RL agents exploit gaps in misspecified reward functions -- has been widely observed, but not yet systematically studied. To understand how reward hacking arises, we construct four RL environments with misspecified rewards. We investigate reward hacking as a function of agent capabilities: model capacity, action space resolution, observation space noise, and training time. More capable agents often exploit reward misspecifications, achieving higher proxy reward and lower true reward than less capable agents. Moreover, we find instances of phase transitions: capability thresholds at which the agent's behavior qualitatively shifts, leading to a sharp decrease in the true reward. Such phase transitions pose challenges to monitoring the safety of ML systems. To address this, we propose an anomaly detection task for aberrant policies and offer several baseline detectors.
    A Survey of Uncertainty in Deep Neural Networks. (arXiv:2107.03342v2 [cs.LG] UPDATED)
    (2 min) Due to their increasing spread, confidence in neural network predictions became more and more important. However, basic neural networks do not deliver certainty estimates or suffer from over or under confidence. Many researchers have been working on understanding and quantifying uncertainty in a neural network's prediction. As a result, different types and sources of uncertainty have been identified and a variety of approaches to measure and quantify uncertainty in neural networks have been proposed. This work gives a comprehensive overview of uncertainty estimation in neural networks, reviews recent advances in the field, highlights current challenges, and identifies potential research opportunities. It is intended to give anyone interested in uncertainty estimation in neural networks a broad overview and introduction, without presupposing prior knowledge in this field. A comprehensive introduction to the most crucial sources of uncertainty is given and their separation into reducible model uncertainty and not reducible data uncertainty is presented. The modeling of these uncertainties based on deterministic neural networks, Bayesian neural networks, ensemble of neural networks, and test-time data augmentation approaches is introduced and different branches of these fields as well as the latest developments are discussed. For a practical application, we discuss different measures of uncertainty, approaches for the calibration of neural networks and give an overview of existing baselines and implementations. Different examples from the wide spectrum of challenges in different fields give an idea of the needs and challenges regarding uncertainties in practical applications. Additionally, the practical limitations of current methods for mission- and safety-critical real world applications are discussed and an outlook on the next steps towards a broader usage of such methods is given.
    Optimal radial basis for density-based atomic representations. (arXiv:2105.08717v2 [stat.ML] UPDATED)
    (2 min) The input of almost every machine learning algorithm targeting the properties of matter at the atomic scale involves a transformation of the list of Cartesian atomic coordinates into a more symmetric representation. Many of the most popular representations can be seen as an expansion of the symmetrized correlations of the atom density, and differ mainly by the choice of basis. Considerable effort has been dedicated to the optimization of the basis set, typically driven by heuristic considerations on the behavior of the regression target. Here we take a different, unsupervised viewpoint, aiming to determine the basis that encodes in the most compact way possible the structural information that is relevant for the dataset at hand. For each training dataset and number of basis functions, one can determine a unique basis that is optimal in this sense, and can be computed at no additional cost with respect to the primitive basis by approximating it with splines. We demonstrate that this construction yields representations that are accurate and computationally efficient, particularly when constructing representations that correspond to high-body order correlations. We present examples that involve both molecular and condensed-phase machine-learning models.
    A Tutorial on Learning With Bayesian Networks. (arXiv:2002.00269v3 [cs.LG] UPDATED)
    (2 min) A Bayesian network is a graphical model that encodes probabilistic relationships among variables of interest. When used in conjunction with statistical techniques, the graphical model has several advantages for data analysis. One, because the model encodes dependencies among all variables, it readily handles situations where some data entries are missing. Two, a Bayesian network can be used to learn causal relationships, and hence can be used to gain understanding about a problem domain and to predict the consequences of intervention. Three, because the model has both a causal and probabilistic semantics, it is an ideal representation for combining prior knowledge (which often comes in causal form) and data. Four, Bayesian statistical methods in conjunction with Bayesian networks offer an efficient and principled approach for avoiding the overfitting of data. In this paper, we discuss methods for constructing Bayesian networks from prior knowledge and summarize Bayesian statistical methods for using data to improve these models. With regard to the latter task, we describe methods for learning both the parameters and structure of a Bayesian network, including techniques for learning with incomplete data. In addition, we relate Bayesian-network methods for learning to techniques for supervised and unsupervised learning. We illustrate the graphical-modeling approach using a real-world case study.
    Provable Clustering of a Union of Linear Manifolds Using Optimal Directions. (arXiv:2201.02745v1 [cs.LG])
    (2 min) This paper focuses on the Matrix Factorization based Clustering (MFC) method which is one of the few closed form algorithms for the subspace clustering problem. Despite being simple, closed-form, and computation-efficient, MFC can outperform the other sophisticated subspace clustering methods in many challenging scenarios. We reveal the connection between MFC and the Innovation Pursuit (iPursuit) algorithm which was shown to be able to outperform the other spectral clustering based methods with a notable margin especially when the span of clusters are close. A novel theoretical study is presented which sheds light on the key performance factors of both algorithms (MFC/iPursuit) and it is shown that both algorithms can be robust to notable intersections between the span of clusters. Importantly, in contrast to the theoretical guarantees of other algorithms which emphasized on the distance between the subspaces as the key performance factor and without making the innovation assumption, it is shown that the performance of MFC/iPursuit mainly depends on the distance between the innovative components of the clusters.
    Trade-offs between membership privacy & adversarially robust learning. (arXiv:2006.04622v2 [cs.LG] UPDATED)
    (2 min) Historically, machine learning methods have not been designed with security in mind. In turn, this has given rise to adversarial examples, carefully perturbed input samples aimed to mislead detection at test time, which have been applied to attack spam and malware classification, and more recently to attack image classification. Consequently, an abundance of research has been devoted to designing machine learning methods that are robust to adversarial examples. Unfortunately, there are desiderata besides robustness that a secure and safe machine learning model must satisfy, such as fairness and privacy. Recent work by Song et al. (2019) has shown, empirically, that there exists a trade-off between robust and private machine learning models. Models designed to be robust to adversarial examples often overfit on training data to a larger extent than standard (non-robust) models. If a dataset contains private information, then any statistical test that separates training and test data by observing a model's outputs can represent a privacy breach, and if a model overfits on training data, these statistical tests become easier. In this work, we identify settings where standard models will overfit to a larger extent in comparison to robust models, and as empirically observed in previous works, settings where the opposite behavior occurs. Thus, it is not necessarily the case that privacy must be sacrificed to achieve robustness. The degree of overfitting naturally depends on the amount of data available for training. We go on to characterize how the training set size factors into the privacy risks exposed by training a robust model on a simple Gaussian data task, and show empirically that our findings hold on image classification benchmark datasets, such as CIFAR-10 and CIFAR-100.
    Wiki-CS: A Wikipedia-Based Benchmark for Graph Neural Networks. (arXiv:2007.02901v2 [cs.LG] UPDATED)
    (2 min) We present Wiki-CS, a novel dataset derived from Wikipedia for benchmarking Graph Neural Networks. The dataset consists of nodes corresponding to Computer Science articles, with edges based on hyperlinks and 10 classes representing different branches of the field. We use the dataset to evaluate semi-supervised node classification and single-relation link prediction models. Our experiments show that these methods perform well on a new domain, with structural properties different from earlier benchmarks. The dataset is publicly available, along with the implementation of the data pipeline and the benchmark experiments, at https://github.com/pmernyei/wiki-cs-dataset .
    L\'evy Induced Stochastic Differential Equation Equipped with Neural Network for Time Series Forecasting. (arXiv:2111.13164v2 [cs.LG] UPDATED)
    (2 min) With the fast development of modern deep learning techniques, the study of dynamic systems and neural networks is increasingly benefiting each other in a lot of different ways. Since uncertainties often arise in real world observations, SDEs (stochastic differential equations) come to play an important role. To be more specific, in this paper, we use a collection of SDEs equipped with neural networks to predict long-term trend of noisy time series which has big jump properties and high probability distribution shift. Our contributions are, first, we explored SDEs driven by $\alpha$-stable L\'evy motion to model the time series data and solved the problem through neural network approximation. Second, we theoretically proved the convergence of the model and obtained the convergence rate. Finally, we illustrated our method by applying it to stock marketing time series prediction and found the convergence order of error.
    Subset Selection with Shrinkage: Sparse Linear Modeling when the SNR is low. (arXiv:1708.03288v4 [stat.ME] UPDATED)
    (2 min) We study a seemingly unexpected and relatively less understood overfitting aspect of a fundamental tool in sparse linear modeling - best subset selection, which minimizes the residual sum of squares subject to a constraint on the number of nonzero coefficients. While the best subset selection procedure is often perceived as the "gold standard" in sparse learning when the signal to noise ratio (SNR) is high, its predictive performance deteriorates when the SNR is low. In particular, it is outperformed by continuous shrinkage methods, such as ridge regression and the Lasso. We investigate the behavior of best subset selection in the high-noise regimes and propose an alternative approach based on a regularized version of the least-squares criterion. Our proposed estimators (a) mitigate, to a large extent, the poor predictive performance of best subset selection in the high-noise regimes; and (b) perform favorably, while generally delivering substantially sparser models, relative to the best predictive models available via ridge regression and the Lasso. We conduct an extensive theoretical analysis of the predictive properties of the proposed approach and provide justification for its superior predictive performance relative to best subset selection when the noise-level is high. Our estimators can be expressed as solutions to mixed integer second order conic optimization problems and, hence, are amenable to modern computational tools from mathematical optimization.
    Enhancing Haptic Distinguishability of Surface Materials with Boosting Technique. (arXiv:2010.02002v4 [cs.LG] UPDATED)
    (2 min) Discriminative features are crucial for several learning applications, such as object detection and classification. Neural networks are extensively used for extracting discriminative features of images and speech signals. However, the lack of large datasets in the haptics domain often limits the applicability of such techniques. This paper presents a general framework for the analysis of the discriminative properties of haptic signals. We demonstrate the effectiveness of spectral features and a boosted embedding technique in enhancing the distinguishability of haptic signals. Experiments indicate our framework needs less training data, generalizes well for different predictors, and outperforms the related state-of-the-art.
    Non-Asymptotic Guarantees for Robust Statistical Learning under $(1+\varepsilon)$-th Moment Assumption. (arXiv:2201.03182v1 [stat.ML])
    (2 min) There has been a surge of interest in developing robust estimators for models with heavy-tailed data in statistics and machine learning. This paper proposes a log-truncated M-estimator for a large family of statistical regressions and establishes its excess risk bound under the condition that the data have $(1+\varepsilon)$-th moment with $\varepsilon \in (0,1]$. With an additional assumption on the associated risk function, we obtain an $\ell_2$-error bound for the estimation. Our theorems are applied to establish robust M-estimators for concrete regressions. Besides convex regressions such as quantile regression and generalized linear models, many non-convex regressions can also be fit into our theorems, we focus on robust deep neural network regressions, which can be solved by the stochastic gradient descent algorithms. Simulations and real data analysis demonstrate the superiority of log-truncated estimations over standard estimations.
  • Data Science Central

    Mastering the Data and Analytics Innovation Index
    (5 min) We are already seeing a handful of organizations who are actively exploiting the unique economic characteristics of data and analytics – digital assets that not only can be used across an unlimited number of use cases at zero marginal cost, but can also continuously-learn, adapt, and refine to become more valuable the more that they… Read More »Mastering the Data and Analytics Innovation Index The post Mastering the Data and Analytics Innovation Index appeared first on Data Science Central.
    How Virtual Reality Is Shaping The Future Of These Four Industries
    (5 min) It’s incredible to think that just 60 years ago, the world was introduced to a new medium of gameplay for the very first time as Pong successfully linked digital technology with a television display. Back then, two sticks hitting a ball back and forth was some seriously ground-breaking stuff. Today, with graphics rivaling the real-world… Read More »How Virtual Reality Is Shaping The Future Of These Four Industries The post How Virtual Reality Is Shaping The Future Of These Four Industries appeared first on Data Science Central.
  • The Official NVIDIA Blog

    World Record-Setting DNA Sequencing Technique Helps Clinicians Rapidly Diagnose Critical Care Patients
    (4 min) Cutting down the time needed to sequence and analyze a patient’s whole genome from days to hours isn’t just about clinical efficiency — it can save lives. By accelerating every step of this process — from collecting a blood sample to sequencing the whole genome to identifying variants linked to diseases — a research team Read article > The post World Record-Setting DNA Sequencing Technique Helps Clinicians Rapidly Diagnose Critical Care Patients appeared first on The Official NVIDIA Blog.
    Elevated Entertainment: SHIELD Experience 9.0 Upgrade Rolling Out Now
    (3 min) SHIELD Software Experience Upgrade 9.0 is rolling out to all NVIDIA SHIELD TVs, delivering the Android 11 operating system and more. An updated Gboard — the Google Keyboard — allows people to use their voices and the Google Assistant to discover content in all search boxes. Additional permissions let users customize privacy across apps, including Read article > The post Elevated Entertainment: SHIELD Experience 9.0 Upgrade Rolling Out Now appeared first on The Official NVIDIA Blog.
  • AWS Machine Learning Blog

    Optimize your inference jobs using dynamic batch inference with TorchServe on Amazon SageMaker
    (8 min) In deep learning, batch processing refers to feeding multiple inputs into a model. Although it’s essential during training, it can be very helpful to manage the cost and optimize throughput during inference time as well. Hardware accelerators are optimized for parallelism, and batching helps saturate the compute capacity and often leads to higher throughput. Batching […]
    Graph-based recommendation system with Neptune ML: An illustration on social network link prediction challenges
    (13 min) Recommendation systems are one of the most widely adopted machine learning (ML) technologies in real-world applications, ranging from social networks to ecommerce platforms. Users of many online systems rely on recommendation systems to make new friendships, discover new music according to suggested music lists, or even make ecommerce purchase decisions based on the recommended products. […]
    Secure access to Amazon SageMaker Studio with AWS SSO and a SAML application
    (9 min) Cloud security at AWS is the highest priority. Amazon SageMaker Studio offers various mechanisms to protect your data and code using integration with AWS security services like AWS Identity and Access Management (IAM), AWS Key Management Service (AWS KMS), or network isolation with Amazon Virtual Private Cloud (Amazon VPC). Customers in highly regulated industries, like […]
  • Becoming Human: Artificial Intelligence Magazine - Medium

    Smart hospitals: the market overview, trends, and considerations
    (10 min) The pandemic has uncovered and amplified the struggles the healthcare sector is facing. From huge doctor workloads to unsatisfied patients…
    Transfer Learning — Part — 5.2!! Implementing ResNet in PyTorch
    (35 min) In Part 5.0 of the Transfer Learning series we have discussed about ResNet pre-trained model in depth so in this series we will implement…
    How to create a recommendation model with help of Machine Learning Techniques and model Ensembling
    (9 min) Sector: Agriculture, Project: “InteliCrop: An Ensemble Model to Predict Crop using Machine Learning Algorithms”
  • John D. Cook

    Queueing theory equations
    (2 min) A blog post about queueing theory that I wrote back in 2008 continues to be popular. The post shows that when a system is close to capacity, adding another server dramatically reduces the wait time. In that example, going from one teller to two tellers doesn’t make service twice as fast but 93 times as […] Queueing theory equations first appeared on John D. Cook.
    Swanson’s rule of thumb
    (2 min) Swanson’s rule of thumb [1] says that the mean of a moderately skewed probability distribution can be approximated by the weighted average of the 10th, 50th, and 90th percentile, with weights 0.3, 0.4, and 0.3 respectively. Because it is based on percentiles, the rule is robust to outliers. Swanson’s rule is used in the oil […] Swanson’s rule of thumb first appeared on John D. Cook.
  • Microsoft Research

    EzPC: Increased data security in the AI model validation process
    (7 min) From manufacturing and logistics to agriculture and transportation, the expansion of artificial intelligence (AI) in the last decade has revolutionized a multitude of industries—examples include enhancing predictive analytics on the manufacturing floor and making microclimate predictions so that farmers can respond and save their crops in time. The adoption of AI is expected to accelerate […] The post EzPC: Increased data security in the AI model validation process appeared first on Microsoft Research.
  • Artificial Intelligence

    Lior Cole Is the Model Combining Artificial Intelligence With Religion
    (1 min) submitted by /u/mr_j_b [link] [comments]
    Artificial intelligence to influence top tech trends in major way in next five years
    (0 min) submitted by /u/mr_j_b [link] [comments]
    How A.I. is set to evolve in 2022
    (0 min) submitted by /u/mr_j_b [link] [comments]
    What is the state of AI? This is the question I try to answer on my blog monthly, hoping to provide valuable information and insights to our community and those outside the field.
    (1 min) submitted by /u/OnlyProggingForFun [link] [comments]
    OneFlow v0.6.0 just came out![P]
    (0 min) submitted by /u/Just0by [link] [comments]
    Ai made Art
    (0 min) submitted by /u/deepnskate [link] [comments]
  • Machine Learning

    [D] Is AI research reaching saturation?
    (3 min) It feels like every topic is heavily researched. There are 100 new papers everyday mentioning small improvements to previous architectures/models with some tweaking. Every task has a near 100% accuracy. Is AI research reaching saturation? What do you guys work on, and how competitive is it in your area? How do you keep up with the several papers cpming in everyday? submitted by /u/Bibbidi_Babbidi_Boo [link] [comments]
    [D] Classical Clustering Benchmarks
    (1 min) Besides for MNIST, what datasets are generally standard to test a method out on? I am looking to test out a non-deep clustering method on various benchmarks. When I look at papers accepted into top ML conferences like NeurIPS, ICML and ICLR, there doesn't appear to be a consistent set of benchmarks. Is there a standard set of clustering benchmarks one is supposed to deploy new methods on? ( I would not expect it this method I'm testing to perform well on e.g. CIFAR-10. My impression is that almost all non-deep methods that aren't custom-tuned to images perform poorly on CIFAR-10) submitted by /u/Grand_Distribution83 [link] [comments]
    [D] [R] large query set few shot learning
    (1 min) Is there a research field specific to image classification/ embedding learning where we have a large number of classes +2k and only few samples per class ~1-5 ? What i noticed in few shot learning papers is the query set is always very small 20-way 1-shot or something similar. Is there any paper you have seen that i can check ? Thank you! submitted by /u/flow_smith [link] [comments]
    [R] ConvNets vs Transformers
    (4 min) A ConvNet for the 2020s - nice read to start 2022. The authors explore modernizations of Resnets and adopt some tricks from transformers training design to make ConvNets great again. There is a lot to reflect and thing about. Code is here. https://preview.redd.it/kqnqe86729b81.png?width=2696&format=png&auto=webp&s=ac0a4f045c61c34756cfcce3073792ace8f64301 submitted by /u/AdelSexy [link] [comments]
    [D] How good are Australian Universities for Deep Learning/AI?
    (1 min) Hey. I was thinking of pursuing my Masters from Australia but am still confused about upto what extent they pursue deep learning. Like I was going through University of Adelaide and found they have their dedicated Computer Vision lab among others which has also produced some good SOTA results in the past. But apart from that, what other universities would the community recommend? Or are there any certain professors which could be of interest. Any suggestions are welcome :) submitted by /u/Unitrix247 [link] [comments]
    [D] GAN training initially degrades results of pre-trained generator
    (1 min) I have an issue with the training of a GAN, which consists of a generator and two discriminators. The generator is used to generate waveforms. 1-The generator is independently pre-trained by regression, up to 400k steps. 2-The two randomly initialized discriminators are then activated, and GANs training takes place. The orange curve, represents the l1-norm between input and output: the l1-norm steeply rises for the first few validation steps. In the blue curve, I tried freezing the generator for 80k steps, to allow pre-training of the discriminators, but the problems persists. However, step 2 inititially degrades the output of the generator, by removing too much information. This is also reflected in all the validation losses (see image above), whose values get worse. The GAN is based or a variation of the code below: https://github.com/rishikksh20/hifigan-denoiser/blob/master/train.py Any suggestions? submitted by /u/alf_Lafleur [link] [comments]
    [P] Tutorial: Time Series Cross-Validation
    (2 min) [Tutorial] - Time Series Cross Validation TL;DR: The code is here. Time Series Forecasting can be overwhelming. Especially if you are just getting started. There are many different types of Time Series tasks each differs by the number of input or output sequences, the number of steps to predict, whether the input and/or the output sequence length is static or changing, and so on. In this notebook, we will experiment with different types of Time Series Cross-Validation Strategies in order to become familiar with them and understand which works best for what case. As written before, Time Series problems can be of different variations, so in order to get a deeper understanding, we should explore each. Variation I: Number of Input / Output sequences (1.0) Single input and single…
    [D] Faster Optimization using GPU Multiprocessing
    (1 min) GPU Multiprocessing for Parameter Optimization TL;DR: The code is here Hi Guys, I published a notebook using a trick for running multiprocessing on GPU devices. Might be interesting for some here, so have fun: As we know, GPU makes everything faster. Moreover, oftentimes the GPU device we use is so powerful that our code doesn't utilize the GPU code to 100% potential. In such cases, running multiple GPU instances in parallel can come in handy and save us some extra time. I've made a notebook that introduces a method for running parallel jobs on GPU devices. It is a bit tricky since most packages that support GPU usage aren't built from the ground up for such use cases. The "trick" is simple: We simply do not call the GPU in the main process, the first time we call for any GPU utilization should be from the child processes. ![](https://i.ibb.co/VDpCKT4/gpu-multiprocessing.png) To demonstrate the power of this concept, I made a notebook that runs Parallel Hyperparameters Optimization on the GPU for LightGBM. As it is assumed, the GPU enables us to train our models faster (much faster) and by leveraging that in combination with the parallel execution we get to search a large space of parameter configurations for fully maximizing our model's performance (Since we can now try many combinations). The Optimization framework is optuna, the popular framework for optimizing the hyperparameters. Again, this is just an example of the approach. You can use the same approach for many other GPU-enabled use cases. Personally, This concept helped me over and over as it is easily transferable to many different GPU-based models including Neural Networks (Tensorflow, Pytorch..), Other GBM models (Catboost, XGBoost..), Feature Engineering (Pandas, CuPy), and more. submitted by /u/yamqwe [link] [comments]
    [R] Julia developers discuss the current state of ML tools in Julia compared to Python
    (1 min) https://discourse.julialang.org/t/state-of-machine-learning-in-julia/74385/18 The developers of some of the largest Julia language packages discuss the current state of ML in Julia, and compare and contrast its status with the Python ML ecosystem. submitted by /u/kdfn [link] [comments]
    [D] Question to ML developers: Do you split your work with a programmer?
    (1 min) Like you do the theory (maths and…) and give it to a programmer to deploy your thoughts. Is it worth your time? submitted by /u/Secure_Pomegranate10 [link] [comments]
    [R] Automated Reinforcement Learning (AutoRL): A Survey and Open Problems
    (1 min) Paper: https://arxiv.org/abs/2201.03916. First survey on the recently developing field of AutoRL. Abstract: The combination of Reinforcement Learning (RL) with deep learning has led to a series of impressive feats, with many believing (deep) RL provides a path towards generally capable agents. However, the success of RL agents is often highly sensitive to design choices in the training process, which may require tedious and error-prone manual tuning. This makes it challenging to use RL for new problems, while also limits its full potential. In many other areas of machine learning, AutoML has shown it is possible to automate such design choices and has also yielded promising initial results when applied to RL. However, Automated Reinforcement Learning (AutoRL) involves not only standard applications of AutoML but also includes additional challenges unique to RL, that naturally produce a different set of methods. As such, AutoRL has been emerging as an important area of research in RL, providing promise in a variety of applications from RNA design to playing games such as Go. Given the diversity of methods and environments considered in RL, much of the research has been conducted in distinct subfields, ranging from meta-learning to evolution. In this survey we seek to unify the field of AutoRL, we provide a common taxonomy, discuss each area in detail and pose open problems which would be of interest to researchers going forward. submitted by /u/machinelearner5000 [link] [comments]
    [D] Edit Videos With CLIP - StyleGAN-V: A Continuous Video Generator with the Price, Image Quality and Perks of StyleGAN2 by Ivan Skorokhodov et al. explained in 5 minutes (by Casual GAN Papers)
    (1 min) While we have seen several new SOTA image generation models pop up over the last year, video generation still remains lackluster, to say the least. But does it have to be? The authors of StyleGAN-V certainly don’t think so! By adapting the generator from StyleGAN2 to work with motion conditions, developing a hypernetwork-based discriminator, and designing a clever acyclic positional encoding, Ivan Skorohodov and the team at KAUST and Snap Inc. deliver a model that generates videos of arbitrary length with arbitrary framerate, is just 5% more expensive to train than a vanilla StyleGAN2, and beats multiple baseline models on 256 and 1024 resolution. Oh, and it only needs to see about 2 frames from a video during training to do so! And if that wasn’t impressive enough, StyleGAN-V is CLIP-compatible for first-ever text-based consistent video editing Full summary: https://t.me/casual_gan/238 Blog post: https://www.casualganpapers.com/text_guided_video_editing_hd_video_generation/StyleGAN-V-explained.html StyleGAN-V: generate hd videos and edit them with CLIP arxiv / code (coming soon) Subscribe to Casual GAN Papers and follow me on Twitter for weekly AI paper summaries! submitted by /u/KirillTheMunchKing [link] [comments]
    [D] Real-time Public data API
    (0 min) I am testing a featured related with data pipeline. To produce a more realistic demo I started searching for a dynamic/real-time public data that I can build a end-to-end project. Best if it is a finance related data. Stock limited order book etc submitted by /u/Late-trip-to-cabo [link] [comments]
  • Reinforcement Learning

    Roman Ring (DeepMind) talks StarCraft, AlphaStar on TalkRL?
    (1 min) submitted by /u/Infamous-Editor5131 [link] [comments]
    Need explanation of some basic code in Spinningup : Spinup utils
    (1 min) Hi. I have a basic query regarding, code in spinup utils. In file test_policy.py, L87. get_action = lambda x : sess.run(action_op, feed_dict={model['x']: x[None,:]})[0] What does the 'x[None,:]' part do/mean? Sorry, I am a beginner and this might be a trivial question. Would appreciate any possible help. submitted by /u/underconfidant_soul [link] [comments]
    Are neural networks necessary to update the actor-critic model parameters ? Is there other method to perform the same .
    (1 min) submitted by /u/aabra__ka__daabra [link] [comments]
    Is Vanilla Policy Gradient algorithm better than Advantage Actor Critic (A2C) ?
    (1 min) Hi! I've been studying Policy Gradient algorithms and implementing them and I've found that with the same amount of data (~5000 samples in total over multiple episodes, 50 "epochs") the simple and basic VPG algorithm with "reward-to-go" function converges to average return of 200 on CartPole-v0 env much, MUCH faster than A2C. Is that how it's supposed to be? Shouldn't the A2C version (with less variance, as the authors claim) be better at convergence? Edit: I've just found a bug in my code, which has improved my results, but still made it comparable to simple VPG. Edit 2: I apologize for even asking the question... It seems that playing around with hyper parameters can change situation drastically. submitted by /u/Decent-Ad9135 [link] [comments]
    Ways for representing environments
    (1 min) Hello there :D, I am currently working on a RL environment of a agent (robot/drone) moving around in the environment to look for a target. Currently I use PPO with Convolutional nets (CNN), where the observations are maps of the area (something like grid world), showing where walls, agents positions, etc are. In a certain timestep the agent collects a stack of maps (3-4) that are fed to the network. I want to move to using LSTM, and I was wondering what other ways can I use to model the observations? Keep in mind my observations should contain agents positions (an agent observes its own location and the locations of other agents) and obstacles in the environment (walls determined by position and dimensions). Is there a better way of modeling these than CNN? I though of using a normal feed forward network, and just feed positions directly and use embeddings for the walls, but I'm not sure how scalable that would be for increasing number of agents (more agents = more input), while in CNN the map dimensions stays the same and just more pixels are set to 1 indicating the agents' locations. Any insights? submitted by /u/AhmedNizam_ [link] [comments]
    [D] Interview - This Team won the Minecraft RL BASALT Challenge! (Paper Explanation & Interview with the authors)
    (1 min) submitted by /u/gwern [link] [comments]

2022-01-11

  • cs.LG updates on arXiv.org

    Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting. (arXiv:2106.13008v5 [cs.LG] UPDATED)
    (2 min) Extending the forecasting time is a critical demand for real applications, such as extreme weather early warning and long-term energy consumption planning. This paper studies the long-term forecasting problem of time series. Prior Transformer-based models adopt various self-attention mechanisms to discover the long-range dependencies. However, intricate temporal patterns of the long-term future prohibit the model from finding reliable dependencies. Also, Transformers have to adopt the sparse versions of point-wise self-attentions for long series efficiency, resulting in the information utilization bottleneck. Going beyond Transformers, we design Autoformer as a novel decomposition architecture with an Auto-Correlation mechanism. We break with the pre-processing convention of series decomposition and renovate it as a basic inner block of deep models. This design empowers Autoformer with progressive decomposition capacities for complex time series. Further, inspired by the stochastic process theory, we design the Auto-Correlation mechanism based on the series periodicity, which conducts the dependencies discovery and representation aggregation at the sub-series level. Auto-Correlation outperforms self-attention in both efficiency and accuracy. In long-term forecasting, Autoformer yields state-of-the-art accuracy, with a 38% relative improvement on six benchmarks, covering five practical applications: energy, traffic, economics, weather and disease. Code is available at this repository: \url{https://github.com/thuml/Autoformer}.
    Time Series Anomaly Detection for Cyber-Physical Systems via Neural System Identification and Bayesian Filtering. (arXiv:2106.07992v2 [cs.LG] UPDATED)
    (2 min) Recent advances in AIoT technologies have led to an increasing popularity of utilizing machine learning algorithms to detect operational failures for cyber-physical systems (CPS). In its basic form, an anomaly detection module monitors the sensor measurements and actuator states from the physical plant, and detects anomalies in these measurements to identify abnormal operation status. Nevertheless, building effective anomaly detection models for CPS is rather challenging as the model has to accurately detect anomalies in presence of highly complicated system dynamics and unknown amount of sensor noise. In this work, we propose a novel time series anomaly detection method called Neural System Identification and Bayesian Filtering (NSIBF) in which a specially crafted neural network architecture is posed for system identification, i.e., capturing the dynamics of CPS in a dynamical state-space model; then a Bayesian filtering algorithm is naturally applied on top of the "identified" state-space model for robust anomaly detection by tracking the uncertainty of the hidden state of the system recursively over time. We provide qualitative as well as quantitative experiments with the proposed method on a synthetic and three real-world CPS datasets, showing that NSIBF compares favorably to the state-of-the-art methods with considerable improvements on anomaly detection in CPS.
    SpinalNet: Deep Neural Network with Gradual Input. (arXiv:2007.03347v3 [cs.CV] UPDATED)
    (2 min) Deep neural networks (DNNs) have achieved the state of the art performance in numerous fields. However, DNNs need high computation times, and people always expect better performance in a lower computation. Therefore, we study the human somatosensory system and design a neural network (SpinalNet) to achieve higher accuracy with fewer computations. Hidden layers in traditional NNs receive inputs in the previous layer, apply activation function, and then transfer the outcomes to the next layer. In the proposed SpinalNet, each layer is split into three splits: 1) input split, 2) intermediate split, and 3) output split. Input split of each layer receives a part of the inputs. The intermediate split of each layer receives outputs of the intermediate split of the previous layer and outputs of the input split of the current layer. The number of incoming weights becomes significantly lower than traditional DNNs. The SpinalNet can also be used as the fully connected or classification layer of DNN and supports both traditional learning and transfer learning. We observe significant error reductions with lower computational costs in most of the DNNs. Traditional learning on the VGG-5 network with SpinalNet classification layers provided the state-of-the-art (SOTA) performance on QMNIST, Kuzushiji-MNIST, EMNIST (Letters, Digits, and Balanced) datasets. Traditional learning with ImageNet pre-trained initial weights and SpinalNet classification layers provided the SOTA performance on STL-10, Fruits 360, Bird225, and Caltech-101 datasets. The scripts of the proposed SpinalNet are available at the following link: https://github.com/dipuk0506/SpinalNet
    Collaborating with Humans without Human Data. (arXiv:2110.08176v2 [cs.LG] UPDATED)
    (2 min) Collaborating with humans requires rapidly adapting to their individual strengths, weaknesses, and preferences. Unfortunately, most standard multi-agent reinforcement learning techniques, such as self-play (SP) or population play (PP), produce agents that overfit to their training partners and do not generalize well to humans. Alternatively, researchers can collect human data, train a human model using behavioral cloning, and then use that model to train "human-aware" agents ("behavioral cloning play", or BCP). While such an approach can improve the generalization of agents to new human co-players, it involves the onerous and expensive step of collecting large amounts of human data first. Here, we study the problem of how to train agents that collaborate well with human partners without using human data. We argue that the crux of the problem is to produce a diverse set of training partners. Drawing inspiration from successful multi-agent approaches in competitive domains, we find that a surprisingly simple approach is highly effective. We train our agent partner as the best response to a population of self-play agents and their past checkpoints taken throughout training, a method we call Fictitious Co-Play (FCP). Our experiments focus on a two-player collaborative cooking simulator that has recently been proposed as a challenge problem for coordination with humans. We find that FCP agents score significantly higher than SP, PP, and BCP when paired with novel agent and human partners. Furthermore, humans also report a strong subjective preference to partnering with FCP agents over all baselines.
    Machine-learning-based arc selection for constrained shortest path problems in column generation. (arXiv:2201.02535v1 [math.OC])
    (2 min) Column generation is an iterative method used to solve a variety of optimization problems. It decomposes the problem into two parts: a master problem, and one or more pricing problems (PP). The total computing time taken by the method is divided between these two parts. In routing or scheduling applications, the problems are mostly defined on a network, and the PP is usually an NP-hard shortest path problem with resource constraints. In this work, we propose a new heuristic pricing algorithm based on machine learning. By taking advantage of the data collected during previous executions, the objective is to reduce the size of the network and accelerate the PP, keeping only the arcs that have a high chance to be part of the linear relaxation solution. The method has been applied to two specific problems: the vehicle and crew scheduling problem in public transit and the vehicle routing problem with time windows. Reductions in computational time of up to 40% can be obtained.
    Community recovery in non-binary and temporal stochastic block models. (arXiv:2008.04790v3 [math.ST] UPDATED)
    (2 min) This article studies the estimation of community memberships from non-binary pair interactions represented by an $N$-by-$N$ tensor whose values are elements of $\mathcal S$, where $N$ is the number of nodes and $\mathcal S$ is the space of the pairwise interactions between the nodes. As an information-theoretic benchmark, we study data sets generated by a non-binary stochastic block model, and derive fundamental information criteria for the recovery of the community memberships as $N \to \infty$. Examples of applications include weighted networks ($\mathcal S = \mathbb R$), link-labeled networks $(\mathcal S = \{0, 1, \dots, L\}$), multiplex networks $(\mathcal S = \{0,1\}^M$) and temporal networks ($\mathcal S = \{0,1\}^T$). For temporal interactions, we show that (i) even a small increase in $T$ may have a big impact on the recovery of community memberships, (ii) consistent recovery is possible even for very sparse data (e.g.\ bounded average degree) when $T$ is large enough. We also present several estimation algorithms, both offline and online, which fully utilise the temporal nature of the observed data. We analyse the accuracy of the proposed estimation algorithms under various assumptions on data sparsity and identifiability. Numerical experiments show that even a poor initial estimate (e.g., blind random guess) of the community assignment leads to high accuracy obtained by the online algorithm after a small number of iterations, and remarkably so also in very sparse regimes.
    Acoustic Neighbor Embeddings. (arXiv:2007.10329v5 [eess.AS] UPDATED)
    (2 min) This paper proposes a novel acoustic word embedding called Acoustic Neighbor Embeddings where speech or text of arbitrary length are mapped to a vector space of fixed, reduced dimensions by adapting stochastic neighbor embedding (SNE) to sequential inputs. The Euclidean distance between coordinates in the embedding space reflects the phonetic confusability between their corresponding sequences. Two encoder neural networks are trained: an acoustic encoder that accepts speech signals in the form of frame-wise subword posterior probabilities obtained from an acoustic model and a text encoder that accepts text in the form of subword transcriptions. Compared to a triplet loss criterion, the proposed method is shown to have more effective gradients for neural network training. Experimentally, it also gives more accurate results with low-dimensional embeddings when the two encoder networks are used in tandem in a word (name) recognition task, and when the text encoder network is used standalone in an approximate phonetic matching task. In particular, in an isolated name recognition task depending solely on Euclidean nearest-neighbor search between the proposed embedding vectors, the recognition accuracy is identical to that of conventional finite state transducer(FST)-based decoding using test data with up to 1 million names in the vocabulary and 40 dimensions in the embeddings.
    Deep Reinforcement Learning. (arXiv:2201.02135v1 [cs.AI] CROSS LISTED)
    (2 min) Deep reinforcement learning has gathered much attention recently. Impressive results were achieved in activities as diverse as autonomous driving, game playing, molecular recombination, and robotics. In all these fields, computer programs have taught themselves to solve difficult problems. They have learned to fly model helicopters and perform aerobatic manoeuvers such as loops and rolls. In some applications they have even become better than the best humans, such as in Atari, Go, poker and StarCraft. The way in which deep reinforcement learning explores complex environments reminds us of how children learn, by playfully trying out things, getting feedback, and trying again. The computer seems to truly possess aspects of human learning; this goes to the heart of the dream of artificial intelligence. The successes in research have not gone unnoticed by educators, and universities have started to offer courses on the subject. The aim of this book is to provide a comprehensive overview of the field of deep reinforcement learning. The book is written for graduate students of artificial intelligence, and for researchers and practitioners who wish to better understand deep reinforcement learning methods and their challenges. We assume an undergraduate-level of understanding of computer science and artificial intelligence; the programming language of this book is Python. We describe the foundations, the algorithms and the applications of deep reinforcement learning. We cover the established model-free and model-based methods that form the basis of the field. Developments go quickly, and we also cover advanced topics: deep multi-agent reinforcement learning, deep hierarchical reinforcement learning, and deep meta learning.
    Weighted Graph-Based Signal Temporal Logic Inference Using Neural Networks. (arXiv:2109.08078v2 [cs.AI] UPDATED)
    (2 min) Extracting spatial-temporal knowledge from data is useful in many applications. It is important that the obtained knowledge is human-interpretable and amenable to formal analysis. In this paper, we propose a method that trains neural networks to learn spatial-temporal properties in the form of weighted graph-based signal temporal logic (wGSTL) formulas. For learning wGSTL formulas, we introduce a flexible wGSTL formula structure in which the user's preference can be applied in the inferred wGSTL formulas. In the proposed framework, each neuron of the neural networks corresponds to a subformula in a flexible wGSTL formula structure. We initially train a neural network to learn the wGSTL operators and then train a second neural network to learn the parameters in a flexible wGSTL formula structure. We use a COVID-19 dataset and a rain prediction dataset to evaluate the performance of the proposed framework and algorithms. We compare the performance of the proposed framework with three baseline classification methods including K-nearest neighbors, decision trees, support vector machine, and artificial neural networks. The classification accuracy obtained by the proposed framework is comparable with the baseline classification methods.
    AugmentedPCA: A Python Package of Supervised and Adversarial Linear Factor Models. (arXiv:2201.02547v1 [stat.ML])
    (2 min) Deep autoencoders are often extended with a supervised or adversarial loss to learn latent representations with desirable properties, such as greater predictivity of labels and outcomes or fairness with respects to a sensitive variable. Despite the ubiquity of supervised and adversarial deep latent factor models, these methods should demonstrate improvement over simpler linear approaches to be preferred in practice. This necessitates a reproducible linear analog that still adheres to an augmenting supervised or adversarial objective. We address this methodological gap by presenting methods that augment the principal component analysis (PCA) objective with either a supervised or an adversarial objective and provide analytic and reproducible solutions. We implement these methods in an open-source Python package, AugmentedPCA, that can produce excellent real-world baselines. We demonstrate the utility of these factor models on an open-source, RNA-seq cancer gene expression dataset, showing that augmenting with a supervised objective results in improved downstream classification performance, produces principal components with greater class fidelity, and facilitates identification of genes aligned with the principal axes of data variance with implications to development of specific types of cancer.
    Quantum Unsupervised and Supervised Learning on Superconducting Processors. (arXiv:1909.04226v2 [quant-ph] UPDATED)
    (2 min) Machine learning algorithms perform well on identifying patterns in many different datasets due to their versatility. However, as one increases the size of the dataset, the computation time for training and using these statistical models grows quickly. Quantum computing offers a new paradigm which may have the ability to overcome these computational difficulties. Here, we propose a quantum analogue to K-means clustering, implement it on simulated superconducting qubits, and compare it to a previously developed quantum support vector machine. We find the algorithm's accuracy comparable to the classical K-means algorithm for clustering and classification problems, and find that it has asymptotic complexity $O(N^{3/2}K^{1/2}\log{P})$, where $N$ is the number of data points, $K$ is the number of clusters, and $P$ is the dimension of the data points, giving a significant speedup over the classical analogue.
    E-GraphSAGE: A Graph Neural Network based Intrusion Detection System for IoT. (arXiv:2103.16329v7 [cs.NI] UPDATED)
    (2 min) This paper presents a new Network Intrusion Detection System (NIDS) based on Graph Neural Networks (GNNs). GNNs are a relatively new sub-field of deep neural networks, which can leverage the inherent structure of graph-based data. Training and evaluation data for NIDSs are typically represented as flow records, which can naturally be represented in a graph format. In this paper, we propose E-GraphSAGE, a GNN approach that allows capturing both the edge features of a graph as well as the topological information for network intrusion detection in IoT networks. To the best of our knowledge, our proposal is the first successful, practical, and extensively evaluated approach of applying GNNs on the problem of network intrusion detection for IoT using flow-based data. Our extensive experimental evaluation on four recent NIDS benchmark datasets shows that our approach outperforms the state-of-the-art in terms of key classification metrics, which demonstrates the potential of GNNs in network intrusion detection, and provides motivation for further research.
    Exploratory Lagrangian-Based Particle Tracing Using Deep Learning. (arXiv:2110.08338v2 [cs.LG] UPDATED)
    (2 min) Time-varying vector fields produced by computational fluid dynamics simulations are often prohibitively large and pose challenges for accurate interactive analysis and exploration. To address these challenges, reduced Lagrangian representations have been increasingly researched as a means to improve scientific time-varying vector field exploration capabilities. This paper presents a novel deep neural network-based particle tracing method to explore time-varying vector fields represented by Lagrangian flow maps. In our workflow, in situ processing is first utilized to extract Lagrangian flow maps, and deep neural networks then use the extracted data to learn flow field behavior. Using a trained model to predict new particle trajectories offers a fixed small memory footprint and fast inference. To demonstrate and evaluate the proposed method, we perform an in-depth study of performance using a well-known analytical data set, the Double Gyre. Our study considers two flow map extraction strategies as well as the impact of the number of training samples and integration durations on efficacy, evaluates multiple sampling options for training and testing and informs hyperparameter settings. Overall, we find our method requires a fixed memory footprint of 10.5 MB to encode a Lagrangian representation of a time-varying vector field while maintaining accuracy. For post hoc analysis, loading the trained model costs only two seconds, significantly reducing the burden of I/O when reading data for visualization. Moreover, our parallel implementation can infer one hundred locations for each of two thousand new pathlines across the entire temporal resolution in 1.3 seconds using one NVIDIA Titan RTX GPU.
    Optimality in Noisy Importance Sampling. (arXiv:2201.02432v1 [stat.ML])
    (2 min) In this work, we analyze the noisy importance sampling (IS), i.e., IS working with noisy evaluations of the target density. We present the general framework and derive optimal proposal densities for noisy IS estimators. The optimal proposals incorporate the information of the variance of the noisy realizations, proposing points in regions where the noise power is higher. We also compare the use of the optimal proposals with previous optimality approaches considered in a noisy IS framework.
    Connecting Weighted Automata, Tensor Networks and Recurrent Neural Networks through Spectral Learning. (arXiv:2010.10029v2 [cs.LG] UPDATED)
    (2 min) In this paper, we present connections between three models used in different research fields: weighted finite automata~(WFA) from formal languages and linguistics, recurrent neural networks used in machine learning, and tensor networks which encompasses a set of optimization techniques for high-order tensors used in quantum physics and numerical analysis. We first present an intrinsic relation between WFA and the tensor train decomposition, a particular form of tensor network. This relation allows us to exhibit a novel low rank structure of the Hankel matrix of a function computed by a WFA and to design an efficient spectral learning algorithm leveraging this structure to scale the algorithm up to very large Hankel matrices.We then unravel a fundamental connection between WFA and second-orderrecurrent neural networks~(2-RNN): in the case of sequences of discrete symbols, WFA and 2-RNN with linear activationfunctions are expressively equivalent. Leveraging this equivalence result combined with the classical spectral learning algorithm for weighted automata, we introduce the first provable learning algorithm for linear 2-RNN defined over sequences of continuous input vectors.This algorithm relies on estimating low rank sub-blocks of the Hankel tensor, from which the parameters of a linear 2-RNN can be provably recovered. The performances of the proposed learning algorithm are assessed in a simulation study on both synthetic and real-world data.
    Path classification by stochastic linear recurrent neural networks. (arXiv:2108.03090v2 [stat.ML] UPDATED)
    (2 min) We investigate the functioning of a classifying biological neural network from the perspective of statistical learning theory, modelled, in a simplified setting, as a continuous-time stochastic recurrent neural network (RNN) with identity activation function. In the purely stochastic (robust) regime, we give a generalisation error bound that holds with high probability, thus showing that the empirical risk minimiser is the best-in-class hypothesis. We show that RNNs retain a partial signature of the paths they are fed as the unique information exploited for training and classification tasks. We argue that these RNNs are easy to train and robust and back these observations with numerical experiments on both synthetic and real data. We also exhibit a trade-off phenomenon between accuracy and robustness.
    A framework for deep learning emulation of numerical models with a case study in satellite remote sensing. (arXiv:1910.13408v2 [cs.LG] UPDATED)
    (3 min) Numerical models based on physics represent the state-of-the-art in earth system modeling and comprise our best tools for generating insights and predictions. Despite rapid growth in computational power, the perceived need for higher model resolutions overwhelms the latest-generation computers, reducing the ability of modelers to generate simulations for understanding parameter sensitivities and characterizing variability and uncertainty. Thus, surrogate models are often developed to capture the essential attributes of the full-blown numerical models. Recent successes of machine learning methods, especially deep learning, across many disciplines offer the possibility that complex nonlinear connectionist representations may be able to capture the underlying complex structures and nonlinear processes in earth systems. A difficult test for deep learning-based emulation, which refers to function approximation of numerical models, is to understand whether they can be comparable to traditional forms of surrogate models in terms of computational efficiency while simultaneously reproducing model results in a credible manner. A deep learning emulation that passes this test may be expected to perform even better than simple models with respect to capturing complex processes and spatiotemporal dependencies. Here we examine, with a case study in satellite-based remote sensing, the hypothesis that deep learning approaches can credibly represent the simulations from a surrogate model with comparable computational efficiency. Our results are encouraging in that the deep learning emulation reproduces the results with acceptable accuracy and often even faster performance. We discuss the broader implications of our results in light of the pace of improvements in high-performance implementations of deep learning as well as the growing desire for higher-resolution simulations in the earth sciences.
    Quantitative Performance Assessment of CNN Units via Topological Entropy Calculation. (arXiv:2103.09716v2 [cs.CV] UPDATED)
    (2 min) Identifying the status of individual network units is critical for understanding the mechanism of convolutional neural networks (CNNs). However, it is still challenging to reliably give a general indication of unit status, especially for units in different network models. To this end, we propose a novel method for quantitatively clarifying the status of single unit in CNN using algebraic topological tools. Unit status is indicated via the calculation of a defined topological-based entropy, called feature entropy, which measures the degree of chaos of the global spatial pattern hidden in the unit for a category. In this way, feature entropy could provide an accurate indication of status for units in different networks with diverse situations like weight-rescaling operation. Further, we show that feature entropy decreases as the layer goes deeper and shares almost simultaneous trend with loss during training. We show that by investigating the feature entropy of units on only training data, it could give discrimination between networks with different generalization ability from the view of the effectiveness of feature representations.
    QuantumNAS: Noise-Adaptive Search for Robust Quantum Circuits. (arXiv:2107.10845v5 [quant-ph] UPDATED)
    (3 min) Quantum noise is the key challenge in Noisy Intermediate-Scale Quantum (NISQ) computers. Previous work for mitigating noise has primarily focused on gate-level or pulse-level noise-adaptive compilation. However, limited research efforts have explored a higher level of optimization by making the quantum circuits themselves resilient to noise. We propose QuantumNAS, a comprehensive framework for noise-adaptive co-search of the variational circuit and qubit mapping. Variational quantum circuits are a promising approach for constructing QML and quantum simulation. However, finding the best variational circuit and its optimal parameters is challenging due to the large design space and parameter training cost. We propose to decouple the circuit search and parameter training by introducing a novel SuperCircuit. The SuperCircuit is constructed with multiple layers of pre-defined parameterized gates and trained by iteratively sampling and updating the parameter subsets (SubCircuits) of it. It provides an accurate estimation of SubCircuits performance trained from scratch. Then we perform an evolutionary co-search of SubCircuit and its qubit mapping. The SubCircuit performance is estimated with parameters inherited from SuperCircuit and simulated with real device noise models. Finally, we perform iterative gate pruning and finetuning to remove redundant gates. Extensively evaluated with 12 QML and VQE benchmarks on 14 quantum computers, QuantumNAS significantly outperforms baselines. For QML, QuantumNAS is the first to demonstrate over 95% 2-class, 85% 4-class, and 32% 10-class classification accuracy on real QC. It also achieves the lowest eigenvalue for VQE tasks on H2, H2O, LiH, CH4, BeH2 compared with UCCSD. We also open-source TorchQuantum (https://github.com/mit-han-lab/torchquantum) for fast training of parameterized quantum circuits to facilitate future research.
    Comprehensive RF Dataset Collection and Release: A Deep Learning-Based Device Fingerprinting Use Case. (arXiv:2201.02213v1 [eess.SP])
    (2 min) Deep learning-based RF fingerprinting has recently been recognized as a potential solution for enabling newly emerging wireless network applications, such as spectrum access policy enforcement, automated network device authentication, and unauthorized network access monitoring and control. Real, comprehensive RF datasets are now needed more than ever to enable the study, assessment, and validation of newly developed RF fingerprinting approaches. In this paper, we present and release a large-scale RF fingerprinting dataset, collected from 25 different LoRa-enabled IoT transmitting devices using USRP B210 receivers. Our dataset consists of a large number of SigMF-compliant binary files representing the I/Q time-domain samples and their corresponding FFT-based files of LoRa transmissions. This dataset provides a comprehensive set of essential experimental scenarios, considering both indoor and outdoor environments and various network deployments and configurations, such as the distance between the transmitters and the receiver, the configuration of the considered LoRa modulation, the physical location of the conducted experiment, and the receiver hardware used for training and testing the neural network models.
    Multi-Model Federated Learning. (arXiv:2201.02582v1 [cs.LG])
    (2 min) Federated learning is a form of distributed learning with the key challenge being the non-identically distributed nature of the data in the participating clients. In this paper, we extend federated learning to the setting where multiple unrelated models are trained simultaneously. Specifically, every client is able to train any one of M models at a time and the server maintains a model for each of the M models which is typically a suitably averaged version of the model computed by the clients. We propose multiple policies for assigning learning tasks to clients over time. In the first policy, we extend the widely studied FedAvg to multi-model learning by allotting models to clients in an i.i.d. stochastic manner. In addition, we propose two new policies for client selection in a multi-model federated setting which make decisions based on current local losses for each client-model pair. We compare the performance of the policies on tasks involving synthetic and real-world data and characterize the performance of the proposed policies. The key take-away from our work is that the proposed multi-model policies perform better or at least as good as single model training using FedAvg.
    Neural calibration of hidden inhomogeneous Markov chains -- Information decompression in life insurance. (arXiv:2201.02397v1 [cs.LG])
    (0 min) Markov chains play a key role in a vast number of areas, including life insurance mathematics. Standard actuarial quantities as the premium value can be interpreted as compressed, lossy information about the underlying Markov process. We introduce a method to reconstruct the underlying Markov chain given collective information of a portfolio of contracts. Our neural architecture explainably characterizes the process by explicitly providing one-step transition probabilities. Further, we provide an intrinsic, economic model validation to inspect the quality of the information decompression. Lastly, our methodology is successfully tested for a realistic data set of German term life insurance contracts.
    Negative Evidence Matters in Interpretable Histology Image Classification. (arXiv:2201.02445v1 [eess.IV])
    (0 min) Using only global annotations such as the image class labels, weakly-supervised learning methods allow CNN classifiers to jointly classify an image, and yield the regions of interest associated with the predicted class. However, without any guidance at the pixel level, such methods may yield inaccurate regions. This problem is known to be more challenging with histology images than with natural ones, since objects are less salient, structures have more variations, and foreground and background regions have stronger similarities. Therefore, methods in computer vision literature for visual interpretation of CNNs may not directly apply. In this work, we propose a simple yet efficient method based on a composite loss function that leverages information from the fully negative samples. Our new loss function contains two complementary terms: the first exploits positive evidence collected from the CNN classifier, while the second leverages the fully negative samples from the training dataset. In particular, we equip a pre-trained classifier with a decoder that allows refining the regions of interest. The same classifier is exploited to collect both the positive and negative evidence at the pixel level to train the decoder. This enables to take advantages of the fully negative samples that occurs naturally in the data, without any additional supervision signals and using only the image class as supervision. Compared to several recent related methods, over the public benchmark GlaS for colon cancer and a Camelyon16 patch-based benchmark for breast cancer using three different backbones, we show the substantial improvements introduced by our method. Our results shows the benefits of using both negative and positive evidence, ie, the one obtained from a classifier and the one naturally available in datasets. We provide an ablation study of both terms. Our code is publicly available.
    MGAE: Masked Autoencoders for Self-Supervised Learning on Graphs. (arXiv:2201.02534v1 [cs.LG])
    (2 min) We introduce a novel masked graph autoencoder (MGAE) framework to perform effective learning on graph structure data. Taking insights from self-supervised learning, we randomly mask a large proportion of edges and try to reconstruct these missing edges during training. MGAE has two core designs. First, we find that masking a high ratio of the input graph structure, e.g., $70\%$, yields a nontrivial and meaningful self-supervisory task that benefits downstream applications. Second, we employ a graph neural network (GNN) as an encoder to perform message propagation on the partially-masked graph. To reconstruct the large number of masked edges, a tailored cross-correlation decoder is proposed. It could capture the cross-correlation between the head and tail nodes of anchor edge in multi-granularity. Coupling these two designs enables MGAE to be trained efficiently and effectively. Extensive experiments on multiple open datasets (Planetoid and OGB benchmarks) demonstrate that MGAE generally performs better than state-of-the-art unsupervised learning competitors on link prediction and node classification.
    Towards Lower Bounds on the Depth of ReLU Neural Networks. (arXiv:2105.14835v3 [cs.LG] UPDATED)
    (2 min) We contribute to a better understanding of the class of functions that is represented by a neural network with ReLU activations and a given architecture. Using techniques from mixed-integer optimization, polyhedral theory, and tropical geometry, we provide a mathematical counterbalance to the universal approximation theorems which suggest that a single hidden layer is sufficient for learning tasks. In particular, we investigate whether the class of exactly representable functions strictly increases by adding more layers (with no restrictions on size). This problem has potential impact on algorithmic and statistical aspects because of the insight it provides into the class of functions represented by neural hypothesis classes. However, to the best of our knowledge, this question has not been investigated in the neural network literature. We also present upper bounds on the sizes of neural networks required to represent functions in these neural hypothesis classes.
    Budget-aware Few-shot Learning via Graph Convolutional Network. (arXiv:2201.02304v1 [cs.CV])
    (2 min) This paper tackles the problem of few-shot learning, which aims to learn new visual concepts from a few examples. A common problem setting in few-shot classification assumes random sampling strategy in acquiring data labels, which is inefficient in practical applications. In this work, we introduce a new budget-aware few-shot learning problem that not only aims to learn novel object categories, but also needs to select informative examples to annotate in order to achieve data efficiency. We develop a meta-learning strategy for our budget-aware few-shot learning task, which jointly learns a novel data selection policy based on a Graph Convolutional Network (GCN) and an example-based few-shot classifier. Our selection policy computes a context-sensitive representation for each unlabeled data by graph message passing, which is then used to predict an informativeness score for sequential selection. We validate our method by extensive experiments on the mini-ImageNet, tiered-ImageNet and Omniglot datasets. The results show our few-shot learning strategy outperforms baselines by a sizable margin, which demonstrates the efficacy of our method.
    Lattice-Based Methods Surpass Sum-of-Squares in Clustering. (arXiv:2112.03898v2 [cs.LG] UPDATED)
    (2 min) Clustering is a fundamental primitive in unsupervised learning which gives rise to a rich class of computationally-challenging inference tasks. In this work, we focus on the canonical task of clustering d-dimensional Gaussian mixtures with unknown (and possibly degenerate) covariance. Recent works (Ghosh et al. '20; Mao, Wein '21; Davis, Diaz, Wang '21) have established lower bounds against the class of low-degree polynomial methods and the sum-of-squares (SoS) hierarchy for recovering certain hidden structures planted in Gaussian clustering instances. Prior work on many similar inference tasks portends that such lower bounds strongly suggest the presence of an inherent statistical-to-computational gap for clustering, that is, a parameter regime where the clustering task is statistically possible but no polynomial-time algorithm succeeds. One special case of the clustering task we consider is equivalent to the problem of finding a planted hypercube vector in an otherwise random subspace. We show that, perhaps surprisingly, this particular clustering model does not exhibit a statistical-to-computational gap, even though the aforementioned low-degree and SoS lower bounds continue to apply in this case. To achieve this, we give a polynomial-time algorithm based on the Lenstra--Lenstra--Lovasz lattice basis reduction method which achieves the statistically-optimal sample complexity of d+1 samples. This result extends the class of problems whose conjectured statistical-to-computational gaps can be "closed" by "brittle" polynomial-time algorithms, highlighting the crucial but subtle role of noise in the onset of statistical-to-computational gaps.
    Binary matrix factorization on special purpose hardware. (arXiv:2010.08693v2 [cs.LG] UPDATED)
    (2 min) Many fundamental problems in data mining can be reduced to one or more NP-hard combinatorial optimization problems. Recent advances in novel technologies such as quantum and quantum-inspired hardware promise a substantial speedup for solving these problems compared to when using general purpose computers but often require the problem to be modeled in a special form, such as an Ising or quadratic unconstrained binary optimization (QUBO) model, in order to take advantage of these devices. In this work, we focus on the important binary matrix factorization (BMF) problem which has many applications in data mining. We propose two QUBO formulations for BMF. We show how clustering constraints can easily be incorporated into these formulations. The special purpose hardware we consider is limited in the number of variables it can handle which presents a challenge when factorizing large matrices. We propose a sampling based approach to overcome this challenge, allowing us to factorize large rectangular matrices. In addition to these methods, we also propose a simple baseline algorithm which outperforms our more sophisticated methods in a few situations. We run experiments on the Fujitsu Digital Annealer, a quantum-inspired complementary metal-oxide-semiconductor (CMOS) annealer, on both synthetic and real data, including gene expression data. These experiments show that our approach is able to produce more accurate BMFs than competing methods.
    Generalized quantum similarity learning. (arXiv:2201.02310v1 [quant-ph])
    (2 min) The similarity between objects is significant in a broad range of areas. While similarity can be measured using off-the-shelf distance functions, they may fail to capture the inherent meaning of similarity, which tends to depend on the underlying data and task. Moreover, conventional distance functions limit the space of similarity measures to be symmetric and do not directly allow comparing objects from different spaces. We propose using quantum networks (GQSim) for learning task-dependent (a)symmetric similarity between data that need not have the same dimensionality. We analyze the properties of such similarity function analytically (for a simple case) and numerically (for a complex case) and showthat these similarity measures can extract salient features of the data. We also demonstrate that the similarity measure derived using this technique is $(\epsilon,\gamma,\tau)$-good, resulting in theoretically guaranteed performance. Finally, we conclude by applying this technique for three relevant applications - Classification, Graph Completion, Generative modeling.
    Applications of Signature Methods to Market Anomaly Detection. (arXiv:2201.02441v1 [q-fin.CP])
    (2 min) Anomaly detection is the process of identifying abnormal instances or events in data sets which deviate from the norm significantly. In this study, we propose a signatures based machine learning algorithm to detect rare or unexpected items in a given data set of time series type. We present applications of signature or randomized signature as feature extractors for anomaly detection algorithms; additionally we provide an easy, representation theoretic justification for the construction of randomized signatures. Our first application is based on synthetic data and aims at distinguishing between real and fake trajectories of stock prices, which are indistinguishable by visual inspection. We also show a real life application by using transaction data from the cryptocurrency market. In this case, we are able to identify pump and dump attempts organized on social networks with F1 scores up to 88% by means of our unsupervised learning algorithm, thus achieving results that are close to the state-of-the-art in the field based on supervised learning.
    Detecting Renewal States in Chains of Variable Length via Intrinsic Bayes Factors. (arXiv:2110.07430v2 [cs.LG] UPDATED)
    (2 min) Markov chains with variable length are useful parsimonious stochastic models able to generate most stationary sequence of discrete symbols. The idea is to identify the suffixes of the past, called contexts, that are relevant to predict the future symbol. Sometimes a single state is a context, and looking at the past and finding this specific state makes the further past irrelevant. States with such property are called renewal states and they can be used to split the chain into independent and identically distributed blocks. In order to identify renewal states for chains with variable length, we propose the use of Intrinsic Bayes Factor to evaluate the hypothesis that some particular state is a renewal state. In this case, the difficulty lies in integrating the marginal posterior distribution for the random context trees for general prior distribution on the space of context trees, with Dirichlet prior for the transition probabilities, and Monte Carlo methods are applied. To show the strength of our method, we analyzed artificial datasets generated from different binary models models and one example coming from the field of Linguistics.
    ITSA: An Information-Theoretic Approach to Automatic Shortcut Avoidance and Domain Generalization in Stereo Matching Networks. (arXiv:2201.02263v1 [cs.CV])
    (2 min) State-of-the-art stereo matching networks trained only on synthetic data often fail to generalize to more challenging real data domains. In this paper, we attempt to unfold an important factor that hinders the networks from generalizing across domains: through the lens of shortcut learning. We demonstrate that the learning of feature representations in stereo matching networks is heavily influenced by synthetic data artefacts (shortcut attributes). To mitigate this issue, we propose an Information-Theoretic Shortcut Avoidance~(ITSA) approach to automatically restrict shortcut-related information from being encoded into the feature representations. As a result, our proposed method learns robust and shortcut-invariant features by minimizing the sensitivity of latent features to input variations. To avoid the prohibitive computational cost of direct input sensitivity optimization, we propose an effective yet feasible algorithm to achieve robustness. We show that using this method, state-of-the-art stereo matching networks that are trained purely on synthetic data can effectively generalize to challenging and previously unseen real data scenarios. Importantly, the proposed method enhances the robustness of the synthetic trained networks to the point that they outperform their fine-tuned counterparts (on real data) for challenging out-of-domain stereo datasets.
    GCWSNet: Generalized Consistent Weighted Sampling for Scalable and Accurate Training of Neural Networks. (arXiv:2201.02283v1 [stat.ML])
    (2 min) We develop the "generalized consistent weighted sampling" (GCWS) for hashing the "powered-GMM" (pGMM) kernel (with a tuning parameter $p$). It turns out that GCWS provides a numerically stable scheme for applying power transformation on the original data, regardless of the magnitude of $p$ and the data. The power transformation is often effective for boosting the performance, in many cases considerably so. We feed the hashed data to neural networks on a variety of public classification datasets and name our method ``GCWSNet''. Our extensive experiments show that GCWSNet often improves the classification accuracy. Furthermore, it is evident from the experiments that GCWSNet converges substantially faster. In fact, GCWS often reaches a reasonable accuracy with merely (less than) one epoch of the training process. This property is much desired because many applications, such as advertisement click-through rate (CTR) prediction models, or data streams (i.e., data seen only once), often train just one epoch. Another beneficial side effect is that the computations of the first layer of the neural networks become additions instead of multiplications because the input data become binary (and highly sparse). Empirical comparisons with (normalized) random Fourier features (NRFF) are provided. We also propose to reduce the model size of GCWSNet by count-sketch and develop the theory for analyzing the impact of using count-sketch on the accuracy of GCWS. Our analysis shows that an ``8-bit'' strategy should work well in that we can always apply an 8-bit count-sketch hashing on the output of GCWS hashing without hurting the accuracy much. There are many other ways to take advantage of GCWS when training deep neural networks. For example, one can apply GCWS on the outputs of the last layer to boost the accuracy of trained deep neural networks.
    Sparse PCA on fixed-rank matrices. (arXiv:2201.02487v1 [cs.LG])
    (2 min) Sparse PCA is the optimization problem obtained from PCA by adding a sparsity constraint on the principal components. Sparse PCA is NP-hard and hard to approximate even in the single-component case. In this paper we settle the computational complexity of sparse PCA with respect to the rank of the covariance matrix. We show that, if the rank of the covariance matrix is a fixed value, then there is an algorithm that solves sparse PCA to global optimality, whose running time is polynomial in the number of features. We also prove a similar result for the version of sparse PCA which requires the principal components to have disjoint supports.
    Causal Mediation Analysis with Hidden Confounders. (arXiv:2102.11724v3 [stat.ML] UPDATED)
    (2 min) An important problem in causal inference is to break down the total effect of a treatment on an outcome into different causal pathways and to quantify the causal effect in each pathway. For instance, in causal fairness, the total effect of being a male employee (i.e., treatment) constitutes its direct effect on annual income (i.e., outcome) and the indirect effect via the employee's occupation (i.e., mediator). Causal mediation analysis (CMA) is a formal statistical framework commonly used to reveal such underlying causal mechanisms. One major challenge of CMA in observational studies is handling confounders, variables that cause spurious causal relationships among treatment, mediator, and outcome. Conventional methods assume sequential ignorability that implies all confounders can be measured, which is often unverifiable in practice. This work aims to circumvent the stringent sequential ignorability assumptions and consider hidden confounders. Drawing upon proxy strategies and recent advances in deep learning, we propose to simultaneously uncover the latent variables that characterize hidden confounders and estimate the causal effects. Empirical evaluations using both synthetic and semi-synthetic datasets validate the effectiveness of the proposed method. We further show the potentials of our approach for causal fairness analysis.
    Learning to Transfer with von Neumann Conditional Divergence. (arXiv:2108.03531v2 [cs.LG] UPDATED)
    (2 min) The similarity of feature representations plays a pivotal role in the success of problems related to domain adaptation. Feature similarity includes both the invariance of marginal distributions and the closeness of conditional distributions given the desired response $y$ (e.g., class labels). Unfortunately, traditional methods always learn such features without fully taking into consideration the information in $y$, which in turn may lead to a mismatch of the conditional distributions or the mix-up of discriminative structures underlying data distributions. In this work, we introduce the recently proposed von Neumann conditional divergence to improve the transferability across multiple domains. We show that this new divergence is differentiable and eligible to easily quantify the functional dependence between features and $y$. Given multiple source tasks, we integrate this divergence to capture discriminative information in $y$ and design novel learning objectives assuming those source tasks are observed either simultaneously or sequentially. In both scenarios, we obtain favorable performance against state-of-the-art methods in terms of smaller generalization error on new tasks and less catastrophic forgetting on source tasks (in the sequential setup).
    Time Series Forecasting Using Fuzzy Cognitive Maps: A Survey. (arXiv:2201.02297v1 [cs.AI])
    (0 min) Among various soft computing approaches for time series forecasting, Fuzzy Cognitive Maps (FCM) have shown remarkable results as a tool to model and analyze the dynamics of complex systems. FCM have similarities to recurrent neural networks and can be classified as a neuro-fuzzy method. In other words, FCMs are a mixture of fuzzy logic, neural network, and expert system aspects, which act as a powerful tool for simulating and studying the dynamic behavior of complex systems. The most interesting features are knowledge interpretability, dynamic characteristics and learning capability. The goal of this survey paper is mainly to present an overview on the most relevant and recent FCM-based time series forecasting models proposed in the literature. In addition, this article considers an introduction on the fundamentals of FCM model and learning methodologies. Also, this survey provides some ideas for future research to enhance the capabilities of FCM in order to cover some challenges in the real-world experiments such as handling non-stationary data and scalability issues. Moreover, equipping FCMs with fast learning algorithms is one of the major concerns in this area.
    BDFA: A Blind Data Adversarial Bit-flip Attack on Deep Neural Networks. (arXiv:2112.03477v2 [cs.CR] UPDATED)
    (2 min) Adversarial bit-flip attack (BFA) on Neural Network weights can result in catastrophic accuracy degradation by flipping a very small number of bits. A major drawback of prior bit flip attack techniques is their reliance on test data. This is frequently not possible for applications that contain sensitive or proprietary data. In this paper, we propose Blind Data Adversarial Bit-flip Attack (BDFA), a novel technique to enable BFA without any access to the training or testing data. This is achieved by optimizing for a synthetic dataset, which is engineered to match the statistics of batch normalization across different layers of the network and the targeted label. Experimental results show that BDFA could decrease the accuracy of ResNet50 significantly from 75.96\% to 13.94\% with only 4 bits flips.
    Bayesian Neural Networks for Reversible Steganography. (arXiv:2201.02478v1 [cs.LG])
    (2 min) Recent advances in deep learning have led to a paradigm shift in reversible steganography. A fundamental pillar of reversible steganography is predictive modelling which can be realised via deep neural networks. However, non-trivial errors exist in inferences about some out-of-distribution and noisy data. In view of this issue, we propose to consider uncertainty in predictive models based upon a theoretical framework of Bayesian deep learning. Bayesian neural networks can be regarded as self-aware machinery; that is, a machine that knows its own limitations. To quantify uncertainty, we approximate the posterior predictive distribution through Monte Carlo sampling with stochastic forward passes. We further show that predictive uncertainty can be disentangled into aleatoric and epistemic uncertainties and these quantities can be learnt in an unsupervised manner. Experimental results demonstrate an improvement delivered by Bayesian uncertainty analysis upon steganographic capacity-distortion performance.
    Similarities and Differences between Machine Learning and Traditional Advanced Statistical Modeling in Healthcare Analytics. (arXiv:2201.02469v1 [cs.LG])
    (2 min) Data scientists and statisticians are often at odds when determining the best approach, machine learning or statistical modeling, to solve an analytics challenge. However, machine learning and statistical modeling are more cousins than adversaries on different sides of an analysis battleground. Choosing between the two approaches or in some cases using both is based on the problem to be solved and outcomes required as well as the data available for use and circumstances of the analysis. Machine learning and statistical modeling are complementary, based on similar mathematical principles, but simply using different tools in an overall analytics knowledge base. Determining the predominant approach should be based on the problem to be solved as well as empirical evidence, such as size and completeness of the data, number of variables, assumptions or lack thereof, and expected outcomes such as predictions or causality. Good analysts and data scientists should be well versed in both techniques and their proper application, thereby using the right tool for the right project to achieve the desired results.
    Principal Component Density Estimation for Scenario Generation Using Normalizing Flows. (arXiv:2104.10410v3 [cs.LG] UPDATED)
    (2 min) Neural networks-based learning of the distribution of non-dispatchable renewable electricity generation from sources such as photovoltaics (PV) and wind as well as load demands has recently gained attention. Normalizing flow density models are particularly well suited for this task due to the training through direct log-likelihood maximization. However, research from the field of image generation has shown that standard normalizing flows can only learn smeared-out versions of manifold distributions. Previous works on normalizing flow-based scenario generation do not address this issue, and the smeared-out distributions result in the sampling of noisy time series. In this paper, we exploit the isometry of the principal component analysis (PCA), which sets up the normalizing flow in a lower-dimensional space while maintaining the direct and computationally efficient likelihood maximization. We train the resulting principal component flow (PCF) on data of PV and wind power generation as well as load demand in Germany in the years 2013 to 2015. The results of this investigation show that the PCF preserves critical features of the original distributions, such as the probability density and frequency behavior of the time series. The application of the PCF is, however, not limited to renewable power generation but rather extends to any data set, time series, or otherwise, which can be efficiently reduced using PCA.
    PAC-Bayesian Matrix Completion with a Spectral Scaled Student Prior. (arXiv:2104.08191v2 [stat.ML] UPDATED)
    (2 min) We study the problem of matrix completion in this paper. A spectral scaled Student prior is exploited to favour the underlying low-rank structure of the data matrix. We provide a thorough theoretical investigation for our approach through PAC-Bayesian bounds. More precisely, our PAC-Bayesian approach enjoys a minimax-optimal oracle inequality which guarantees that our method works well under model misspecification and under general sampling distribution. Interestingly, we also provide efficient gradient-based sampling implementations for our approach by using Langevin Monte Carlo. More specifically, we show that our algorithms are significantly faster than Gibbs sampler in this problem. To illustrate the attractive features of our inference strategy, some numerical simulations are conducted and an application to image inpainting is demonstrated.
    Robust Image Retrieval-based Visual Localization using Kapture. (arXiv:2007.13867v3 [cs.CV] UPDATED)
    (2 min) Visual localization tackles the challenge of estimating the camera pose from images by using correspondence analysis between query images and a map. This task is computation and data intensive which poses challenges on thorough evaluation of methods on various datasets. However, in order to further advance in the field, we claim that robust visual localization algorithms should be evaluated on multiple datasets covering a broad domain variety. To facilitate this, we introduce kapture, a new, flexible, unified data format and toolbox for visual localization and structure-from-motion (SFM). It enables easy usage of different datasets as well as efficient and reusable data processing. To demonstrate this, we present a versatile pipeline for visual localization that facilitates the use of different local and global features, 3D data (e.g. depth maps), non-vision sensor data (e.g. IMU, GPS, WiFi), and various processing algorithms. Using multiple configurations of the pipeline, we show the great versatility of kapture in our experiments. Furthermore, we evaluate our methods on eight public datasets where they rank top on all and first on many of them. To foster future research, we release code, models, and all datasets used in this paper in the kapture format open source under a permissive BSD license. github.com/naver/kapture, github.com/naver/kapture-localization
    MIN2Net: End-to-End Multi-Task Learning for Subject-Independent Motor Imagery EEG Classification. (arXiv:2102.03814v4 [eess.SP] UPDATED)
    (2 min) Advances in the motor imagery (MI)-based brain-computer interfaces (BCIs) allow control of several applications by decoding neurophysiological phenomena, which are usually recorded by electroencephalography (EEG) using a non-invasive technique. Despite great advances in MI-based BCI, EEG rhythms are specific to a subject and various changes over time. These issues point to significant challenges to enhance the classification performance, especially in a subject-independent manner. To overcome these challenges, we propose MIN2Net, a novel end-to-end multi-task learning to tackle this task. We integrate deep metric learning into a multi-task autoencoder to learn a compact and discriminative latent representation from EEG and perform classification simultaneously. This approach reduces the complexity in pre-processing, results in significant performance improvement on EEG classification. Experimental results in a subject-independent manner show that MIN2Net outperforms the state-of-the-art techniques, achieving an F1-score improvement of 6.72%, and 2.23% on the SMR-BCI, and OpenBMI datasets, respectively. We demonstrate that MIN2Net improves discriminative information in the latent representation. This study indicates the possibility and practicality of using this model to develop MI-based BCI applications for new users without the need for calibration.
    Auction-Based Ex-Post-Payment Incentive Mechanism Design for Horizontal Federated Learning with Reputation and Contribution Measurement. (arXiv:2201.02410v1 [cs.AI])
    (2 min) Federated learning trains models across devices with distributed data, while protecting the privacy and obtaining a model similar to that of centralized ML. A large number of workers with data and computing power are the foundation of federal learning. However, the inevitable costs prevent self-interested workers from serving for free. Moreover, due to data isolation, task publishers lack effective methods to select, evaluate and pay reliable workers with high-quality data. Therefore, we design an auction-based incentive mechanism for horizontal federated learning with reputation and contribution measurement. By designing a reasonable method of measuring contribution, we establish the reputation of workers, which is easy to decline and difficult to improve. Through reverse auctions, workers bid for tasks, and the task publisher selects workers combining reputation and bid price. With the budget constraint, winning workers are paid based on performance. We proved that our mechanism satisfies the individual rationality of the honest worker, budget feasibility, truthfulness, and computational efficiency.
    Conformer-Based Self-Supervised Learning for Non-Speech Audio Tasks. (arXiv:2110.07313v3 [cs.SD] UPDATED)
    (2 min) Representation learning from unlabeled data has been of major interest in artificial intelligence research. While self-supervised speech representation learning has been popular in the speech research community, very few works have comprehensively analyzed audio representation learning for non-speech audio tasks. In this paper, we propose a self-supervised audio representation learning method and apply it to a variety of downstream non-speech audio tasks. We combine the well-known wav2vec 2.0 framework, which has shown success in self-supervised learning for speech tasks, with parameter-efficient conformer architectures. Our self-supervised pre-training can reduce the need for labeled data by two-thirds. On the AudioSet benchmark, we achieve a mean average precision (mAP) score of 0.415, which is a new state-of-the-art on this dataset through audio-only self-supervised learning. Our fine-tuned conformers also surpass or match the performance of previous systems pre-trained in a supervised way on several downstream tasks. We further discuss the important design considerations for both pre-training and fine-tuning.
    Programmatic Reward Design by Example. (arXiv:2112.08438v2 [cs.LG] UPDATED)
    (2 min) Reward design is a fundamental problem in reinforcement learning (RL). A misspecified or poorly designed reward can result in low sample efficiency and undesired behaviors. In this paper, we propose the idea of programmatic reward design, i.e. using programs to specify the reward functions in RL environments. Programs allow human engineers to express sub-goals and complex task scenarios in a structured and interpretable way. The challenge of programmatic reward design, however, is that while humans can provide the high-level structures, properly setting the low-level details, such as the right amount of reward for a specific sub-task, remains difficult. A major contribution of this paper is a probabilistic framework that can infer the best candidate programmatic reward function from expert demonstrations. Inspired by recent generative-adversarial approaches, our framework searches for the most likely programmatic reward function under which the optimally generated trajectories cannot be differentiated from the demonstrated trajectories. Experimental results show that programmatic reward functionslearned using this framework can significantly outperform those learned using existing reward learning algo-rithms, and enable RL agents to achieve state-of-the-artperformance on highly complex tasks.
    Offline Reinforcement Learning for Road Traffic Control. (arXiv:2201.02381v1 [cs.AI])
    (2 min) Traffic signal control is an important problem in urban mobility with a significant potential of economic and environmental impact. While there is a growing interest in Reinforcement Learning (RL) for traffic control, the work so far has focussed on learning through interactions which, in practice, is costly. Instead, real experience data on traffic is available and could be exploited at minimal costs. Recent progress in offline or batch RL has enabled just that. Model-based offline RL methods, in particular, have been shown to generalize to the experience data much better than others. We build a model-based learning framework, A-DAC, which infers a Markov Decision Process (MDP) from dataset with pessimistic costs built in to deal with data uncertainties. The costs are modeled through an adaptive shaping of rewards in the MDP which provides better regularization of data compared to the prior related work. A-DAC is evaluated on a complex signalized roundabout using multiple datasets varying in size and in batch collection policy. The evaluation results show that it is possible to build high performance control policies in a data efficient manner using simplistic batch collection policies.
    Explainable deep learning for insights in El Nino and river flows. (arXiv:2201.02596v1 [physics.ao-ph])
    (2 min) The El Nino Southern Oscillation (ENSO) is a semi-periodic fluctuation in sea surface temperature (SST) over the tropical central and eastern Pacific Ocean that influences interannual variability in regional hydrology across the world through long-range dependence or teleconnections. Recent research has demonstrated the value of Deep Learning (DL) methods for improving ENSO prediction as well as Complex Networks (CN) for understanding teleconnections. However, gaps in predictive understanding of ENSO-driven river flows include the black box nature of DL, the use of simple ENSO indices to describe a complex phenomenon and translating DL-based ENSO predictions to river flow predictions. Here we show that eXplainable DL (XDL) methods, based on saliency maps, can extract interpretable predictive information contained in global SST and discover novel SST information regions and dependence structures relevant for river flows which, in tandem with climate network constructions, enable improved predictive understanding. Our results reveal additional information content in global SST beyond ENSO indices, develop new understanding of how SSTs influence river flows, and generate improved river flow predictions with uncertainties. Observations, reanalysis data, and earth system model simulations are used to demonstrate the value of the XDL-CN based methods for future interannual and decadal scale climate projections.
    Forecasting emissions through Kaya identity using Neural Ordinary Differential Equations. (arXiv:2201.02433v1 [cs.LG])
    (2 min) Starting from the Kaya identity, we used a Neural ODE model to predict the evolution of several indicators related to carbon emissions, on a country-level: population, GDP per capita, energy intensity of GDP, carbon intensity of energy. We compared the model with a baseline statistical model - VAR - and obtained good performances. We conclude that this machine-learning approach can be used to produce a wide range of results and give relevant insight to policymakers
    Churn prediction in online gambling. (arXiv:2201.02463v1 [cs.LG])
    (2 min) In business retention, churn prevention has always been a major concern. This work contributes to this domain by formalizing the problem of churn prediction in the context of online gambling as a binary classification task. We also propose an algorithmic answer to this problem based on recurrent neural network. This algorithm is tested with online gambling data that have the form of time series, which can be efficiently processed by recurrent neural networks. To evaluate the performances of the trained models, standard machine learning metrics were used, such as accuracy, precision and recall. For this problem in particular, the conducted experiments allowed to assess that the choice of a specific architecture depends on the metric which is given the greatest importance. Architectures using nBRC favour precision, those using LSTM give better recall, while GRU-based architectures allow a higher accuracy and balance two other metrics. Moreover, further experiments showed that using only the more recent time-series histories to train the networks decreases the quality of the results. We also study the performances of models learned at a specific instant $t$, at other times $t^{\prime} > t$. The results show that the performances of the models learned at time $t$ remain good at the following instants $t^{\prime} > t$, suggesting that there is no need to refresh the models at a high rate. However, the performances of the models were subject to noticeable variance due to one-off events impacting the data.
    Linear classifier, least-squares cost function, and outliers. (arXiv:1808.09222v2 [physics.data-an] UPDATED)
    (2 min) A set of introductory notes on the subject of data classification using a linear classifier and least-squares cost function, and the negative effect of the presence of outliers on the decision boundary of the linear discriminant. We also show how a simple scaling could make the outlier less significant, thereby obtaining a much better decision boundary. We present some numerical results.
    iDECODe: In-distribution Equivariance for Conformal Out-of-distribution Detection. (arXiv:2201.02331v1 [cs.LG])
    (2 min) Machine learning methods such as deep neural networks (DNNs), despite their success across different domains, are known to often generate incorrect predictions with high confidence on inputs outside their training distribution. The deployment of DNNs in safety-critical domains requires detection of out-of-distribution (OOD) data so that DNNs can abstain from making predictions on those. A number of methods have been recently developed for OOD detection, but there is still room for improvement. We propose the new method iDECODe, leveraging in-distribution equivariance for conformal OOD detection. It relies on a novel base non-conformity measure and a new aggregation method, used in the inductive conformal anomaly detection framework, thereby guaranteeing a bounded false detection rate. We demonstrate the efficacy of iDECODe by experiments on image and audio datasets, obtaining state-of-the-art results. We also show that iDECODe can detect adversarial examples.
    DiffusionNet: Discretization Agnostic Learning on Surfaces. (arXiv:2012.00888v3 [cs.CV] UPDATED)
    (2 min) We introduce a new general-purpose approach to deep learning on 3D surfaces, based on the insight that a simple diffusion layer is highly effective for spatial communication. The resulting networks are automatically robust to changes in resolution and sampling of a surface -- a basic property which is crucial for practical applications. Our networks can be discretized on various geometric representations such as triangle meshes or point clouds, and can even be trained on one representation then applied to another. We optimize the spatial support of diffusion as a continuous network parameter ranging from purely local to totally global, removing the burden of manually choosing neighborhood sizes. The only other ingredients in the method are a multi-layer perceptron applied independently at each point, and spatial gradient features to support directional filters. The resulting networks are simple, robust, and efficient. Here, we focus primarily on triangle mesh surfaces, and demonstrate state-of-the-art results for a variety of tasks including surface classification, segmentation, and non-rigid correspondence.
    FedDQ: Communication-Efficient Federated Learning with Descending Quantization. (arXiv:2110.02291v3 [cs.LG] UPDATED)
    (2 min) Federated learning (FL) is an emerging privacy-preserving distributed learning scheme. Due to the large model size and frequent model aggregation, FL suffers from critical communication bottleneck. Many techniques have been proposed to reduce the communication volume, including model compression and quantization. Existing adaptive quantization schemes use ascending-trend quantization where the quantizaion level increases with the training stages. In this paper, we formulate the problem as optimizing the training convergence rate for a given communication volume. The result shows that the optimal quantizaiton level can be represented by two factors, i.e., the training loss and the range of model updates, and it is preferable to decrease the quantization level rather than increase. Then, we propose two descending quantization schemes based on the training loss and model range. Experimental results show that proposed schemes not only reduce the communication volume but also help FL converge faster, when compared with current ascending quantization.
    Visual Attention Prediction Improves Performance of Autonomous Drone Racing Agents. (arXiv:2201.02569v1 [cs.RO])
    (2 min) Humans race drones faster than neural networks trained for end-to-end autonomous flight. This may be related to the ability of human pilots to select task-relevant visual information effectively. This work investigates whether neural networks capable of imitating human eye gaze behavior and attention can improve neural network performance for the challenging task of vision-based autonomous drone racing. We hypothesize that gaze-based attention prediction can be an efficient mechanism for visual information selection and decision making in a simulator-based drone racing task. We test this hypothesis using eye gaze and flight trajectory data from 18 human drone pilots to train a visual attention prediction model. We then use this visual attention prediction model to train an end-to-end controller for vision-based autonomous drone racing using imitation learning. We compare the drone racing performance of the attention-prediction controller to those using raw image inputs and image-based abstractions (i.e., feature tracks). Our results show that attention-prediction based controllers outperform the baselines and are able to complete a challenging race track consistently with up to 88% success rate. Furthermore, visual attention-prediction and feature-track based models showed better generalization performance than image-based models when evaluated on hold-out reference trajectories. Our results demonstrate that human visual attention prediction improves the performance of autonomous vision-based drone racing agents and provides an essential step towards vision-based, fast, and agile autonomous flight that eventually can reach and even exceed human performances.
    Assigning function to protein-protein interactions: a weakly supervised BioBERT based approach using PubMed abstracts. (arXiv:2008.08727v3 [cs.CL] UPDATED)
    (2 min) Motivation: Protein-protein interactions (PPI) are critical to the function of proteins in both normal and diseased cells, and many critical protein functions are mediated by interactions.Knowledge of the nature of these interactions is important for the construction of networks to analyse biological data. However, only a small percentage of PPIs captured in protein interaction databases have annotations of function available, e.g. only 4% of PPI are functionally annotated in the IntAct database. Here, we aim to label the function type of PPIs by extracting relationships described in PubMed abstracts. Method: We create a weakly supervised dataset from the IntAct PPI database containing interacting protein pairs with annotated function and associated abstracts from the PubMed database. We apply a state-of-the-art deep learning technique for biomedical natural language processing tasks, BioBERT, to build a model - dubbed PPI-BioBERT - for identifying the function of PPIs. In order to extract high quality PPI functions at large scale, we use an ensemble of PPI-BioBERT models to improve uncertainty estimation and apply an interaction type-specific threshold to counteract the effects of variations in the number of training samples per interaction type. Results: We scan 18 million PubMed abstracts to automatically identify 3253 new typed PPIs, including phosphorylation and acetylation interactions, with an overall precision of 46% (87% for acetylation) based on a human-reviewed sample. This work demonstrates that analysis of biomedical abstracts for PPI function extraction is a feasible approach to substantially increasing the number of interactions annotated with function captured in online databases.
    Neural Network Optimization for Reinforcement Learning Tasks Using Sparse Computations. (arXiv:2201.02571v1 [cs.LG])
    (2 min) This article proposes a sparse computation-based method for optimizing neural networks for reinforcement learning (RL) tasks. This method combines two ideas: neural network pruning and taking into account input data correlations; it makes it possible to update neuron states only when changes in them exceed a certain threshold. It significantly reduces the number of multiplications when running neural networks. We tested different RL tasks and achieved 20-150x reduction in the number of multiplications. There were no substantial performance losses; sometimes the performance even improved.
    Introducing Self-Attention to Target Attentive Graph Neural Networks. (arXiv:2107.01516v3 [cs.IR] UPDATED)
    (2 min) Session-based recommendation systems suggest relevant items to users by modeling user behavior and preferences using short-term anonymous sessions. Existing methods leverage Graph Neural Networks (GNNs) that propagate and aggregate information from neighboring nodes i.e., local message passing. Such graph-based architectures have representational limits, as a single sub-graph is susceptible to overfit the sequential dependencies instead of accounting for complex transitions between items in different sessions. We propose a new technique that leverages a Transformer in combination with a target attentive GNN. This allows richer representations to be learnt, which translates to empirical performance gains in comparison to a vanilla target attentive GNN. Our experimental results and ablation show that our proposed method is competitive with the existing methods on real-world benchmark datasets, improving on graph-based hypotheses. Code is available at https://github.com/The-Learning-Machines/SBR
    Learning to be adversarially robust and differentially private. (arXiv:2201.02265v1 [cs.LG])
    (2 min) We study the difficulties in learning that arise from robust and differentially private optimization. We first study convergence of gradient descent based adversarial training with differential privacy, taking a simple binary classification task on linearly separable data as an illustrative example. We compare the gap between adversarial and nominal risk in both private and non-private settings, showing that the data dimensionality dependent term introduced by private optimization compounds the difficulties of learning a robust model. After this, we discuss what parts of adversarial training and differential privacy hurt optimization, identifying that the size of adversarial perturbation and clipping norm in differential privacy both increase the curvature of the loss landscape, implying poorer generalization performance.
    Unsupervised Multimodal Language Representations using Convolutional Autoencoders. (arXiv:2110.03007v2 [cs.CL] UPDATED)
    (2 min) Multimodal Language Analysis is a demanding area of research, since it is associated with two requirements: combining different modalities and capturing temporal information. During the last years, several works have been proposed in the area, mostly centered around supervised learning in downstream tasks. In this paper we propose extracting unsupervised Multimodal Language representations that are universal and can be applied to different tasks. Towards this end, we map the word-level aligned multimodal sequences to 2-D matrices and then use Convolutional Autoencoders to learn embeddings by combining multiple datasets. Extensive experimentation on Sentiment Analysis (MOSEI) and Emotion Recognition (IEMOCAP) indicate that the learned representations can achieve near-state-of-the-art performance with just the use of a Logistic Regression algorithm for downstream classification. It is also shown that our method is extremely lightweight and can be easily generalized to other tasks and unseen data with small performance drop and almost the same number of parameters. The proposed multimodal representation models are open-sourced and will help grow the applicability of Multimodal Language.
    Universal Paralinguistic Speech Representations Using Self-Supervised Conformers. (arXiv:2110.04621v2 [cs.SD] UPDATED)
    (2 min) Many speech applications require understanding aspects beyond the words being spoken, such as recognizing emotion, detecting whether the speaker is wearing a mask, or distinguishing real from synthetic speech. In this work, we introduce a new state-of-the-art paralinguistic representation derived from large-scale, fully self-supervised training of a 600M+ parameter Conformer-based architecture. We benchmark on a diverse set of speech tasks and demonstrate that simple linear classifiers trained on top of our time-averaged representation outperform nearly all previous results, in some cases by large margins. Our analyses of context-window size demonstrate that, surprisingly, 2 second context-windows achieve 96\% the performance of the Conformers that use the full long-term context on 7 out of 9 tasks. Furthermore, while the best per-task representations are extracted internally in the network, stable performance across several layers allows a single universal representation to reach near optimal performance on all tasks.
    Mirror Learning: A Unifying Framework of Policy Optimisation. (arXiv:2201.02373v1 [cs.LG])
    (2 min) General policy improvement (GPI) and trust-region learning (TRL) are the predominant frameworks within contemporary reinforcement learning (RL), which serve as the core models for solving Markov decision processes (MDPs). Unfortunately, in their mathematical form, they are sensitive to modifications, and thus, the practical instantiations that implement them do not automatically inherit their improvement guarantees. As a result, the spectrum of available rigorous MDP-solvers is narrow. Indeed, many state-of-the-art (SOTA) algorithms, such as TRPO and PPO, are not proven to converge. In this paper, we propose \textsl{mirror learning} -- a general solution to the RL problem. We reveal GPI and TRL to be but small points within this far greater space of algorithms which boasts the monotonic improvement property and converges to the optimal policy. We show that virtually all SOTA algorithms for RL are instances of mirror learning, and thus suggest that their empirical performance is a consequence of their theoretical properties, rather than of approximate analogies. Excitingly, we show that mirror learning opens up a whole new space of policy learning methods with convergence guarantees.
    Bayesian Online Change Point Detection for Baseline Shifts. (arXiv:2201.02325v1 [stat.ML])
    (2 min) In time series data analysis, detecting change points on a real-time basis (online) is of great interest in many areas, such as finance, environmental monitoring, and medicine. One promising means to achieve this is the Bayesian online change point detection (BOCPD) algorithm, which has been successfully adopted in particular cases in which the time series of interest has a fixed baseline. However, we have found that the algorithm struggles when the baseline irreversibly shifts from its initial state. This is because with the original BOCPD algorithm, the sensitivity with which a change point can be detected is degraded if the data points are fluctuating at locations relatively far from the original baseline. In this paper, we not only extend the original BOCPD algorithm to be applicable to a time series whose baseline is constantly shifting toward unknown values but also visualize why the proposed extension works. To demonstrate the efficacy of the proposed algorithm compared to the original one, we examine these algorithms on two real-world data sets and six synthetic data sets.
    GenLabel: Mixup Relabeling using Generative Models. (arXiv:2201.02354v1 [cs.LG])
    (2 min) Mixup is a data augmentation method that generates new data points by mixing a pair of input data. While mixup generally improves the prediction performance, it sometimes degrades the performance. In this paper, we first identify the main causes of this phenomenon by theoretically and empirically analyzing the mixup algorithm. To resolve this, we propose GenLabel, a simple yet effective relabeling algorithm designed for mixup. In particular, GenLabel helps the mixup algorithm correctly label mixup samples by learning the class-conditional data distribution using generative models. Via extensive theoretical and empirical analysis, we show that mixup, when used together with GenLabel, can effectively resolve the aforementioned phenomenon, improving the generalization performance and the adversarial robustness.
    3D Intracranial Aneurysm Classification and Segmentation via Unsupervised Dual-branch Learning. (arXiv:2201.02198v1 [eess.IV])
    (2 min) Intracranial aneurysms are common nowadays and how to detect them intelligently is of great significance in digital health. While most existing deep learning research focused on medical images in a supervised way, we introduce an unsupervised method for the detection of intracranial aneurysms based on 3D point cloud data. In particular, our method consists of two stages: unsupervised pre-training and downstream tasks. As for the former, the main idea is to pair each point cloud with its jittered counterpart and maximise their correspondence. Then we design a dual-branch contrastive network with an encoder for each branch and a subsequent common projection head. As for the latter, we design simple networks for supervised classification and segmentation training. Experiments on the public dataset (IntrA) show that our unsupervised method achieves comparable or even better performance than some state-of-the-art supervised techniques, and it is most prominent in the detection of aneurysmal vessels. Experiments on the ModelNet40 also show that our method achieves the accuracy of 90.79\% which outperforms existing state-of-the-art unsupervised models.
    k-Center Clustering with Outliers in Sliding Windows. (arXiv:2201.02448v1 [cs.LG])
    (2 min) Metric $k$-center clustering is a fundamental unsupervised learning primitive. Although widely used, this primitive is heavily affected by noise in the data, so that a more sensible variant seeks for the best solution that disregards a given number $z$ of points of the dataset, called outliers. We provide efficient algorithms for this important variant in the streaming model under the sliding window setting, where, at each time step, the dataset to be clustered is the window $W$ of the most recent data items. Our algorithms achieve $O(1)$ approximation and, remarkably, require a working memory linear in $k+z$ and only logarithmic in $|W|$. As a by-product, we show how to estimate the effective diameter of the window $W$, which is a measure of the spread of the window points, disregarding a given fraction of noisy distances. We also provide experimental evidence of the practical viability of our theoretical results.
    Audiomer: A Convolutional Transformer For Keyword Spotting. (arXiv:2109.10252v3 [cs.LG] UPDATED)
    (2 min) Transformers have seen an unprecedented rise in Natural Language Processing and Computer Vision tasks. However, in audio tasks, they are either infeasible to train due to extremely large sequence length of audio waveforms or incur a performance penalty when trained on Fourier-based features. In this work, we introduce an architecture, Audiomer, where we combine 1D Residual Networks with Performer Attention to achieve state-of-the-art performance in keyword spotting with raw audio waveforms, outperforming all previous methods while being computationally cheaper and parameter-efficient. Additionally, our model has practical advantages for speech processing, such as inference on arbitrarily long audio clips owing to the absence of positional encoding. The code is available at https://github.com/The-Learning-Machines/Audiomer-PyTorch.
    Spatial-Temporal Sequential Hypergraph Network for Crime Prediction. (arXiv:2201.02435v1 [cs.LG])
    (2 min) Crime prediction is crucial for public safety and resource optimization, yet is very challenging due to two aspects: i) the dynamics of criminal patterns across time and space, crime events are distributed unevenly on both spatial and temporal domains; ii) time-evolving dependencies between different types of crimes (e.g., Theft, Robbery, Assault, Damage) which reveal fine-grained semantics of crimes. To tackle these challenges, we propose Spatial-Temporal Sequential Hypergraph Network (ST-SHN) to collectively encode complex crime spatial-temporal patterns as well as the underlying category-wise crime semantic relationships. In specific, to handle spatial-temporal dynamics under the long-range and global context, we design a graph-structured message passing architecture with the integration of the hypergraph learning paradigm. To capture category-wise crime heterogeneous relations in a dynamic environment, we introduce a multi-channel routing mechanism to learn the time-evolving structural dependency across crime types. We conduct extensive experiments on two real-world datasets, showing that our proposed ST-SHN framework can significantly improve the prediction performance as compared to various state-of-the-art baselines. The source code is available at: https://github.com/akaxlh/ST-SHN.
    Exploring Adversarial Robustness of Multi-Sensor Perception Systems in Self Driving. (arXiv:2101.06784v3 [cs.CV] UPDATED)
    (2 min) Modern self-driving perception systems have been shown to improve upon processing complementary inputs such as LiDAR with images. In isolation, 2D images have been found to be extremely vulnerable to adversarial attacks. Yet, there have been limited studies on the adversarial robustness of multi-modal models that fuse LiDAR features with image features. Furthermore, existing works do not consider physically realizable perturbations that are consistent across the input modalities. In this paper, we showcase practical susceptibilities of multi-sensor detection by placing an adversarial object on top of a host vehicle. We focus on physically realizable and input-agnostic attacks as they are feasible to execute in practice, and show that a single universal adversary can hide different host vehicles from state-of-the-art multi-modal detectors. Our experiments demonstrate that successful attacks are primarily caused by easily corrupted image features. Furthermore, we find that in modern sensor fusion methods which project image features into 3D, adversarial attacks can exploit the projection process to generate false positives across distant regions in 3D. Towards more robust multi-modal perception systems, we show that adversarial training with feature denoising can boost robustness to such attacks significantly. However, we find that standard adversarial defenses still struggle to prevent false positives which are also caused by inaccurate associations between 3D LiDAR points and 2D pixels.
    Large-scale protein-protein post-translational modification extraction with distant supervision and confidence calibrated BioBERT. (arXiv:2201.02229v1 [cs.LG])
    (2 min) Protein-protein interactions (PPIs) are critical to normal cellular function and are related to many disease pathways. However, only 4% of PPIs are annotated with PTMs in biological knowledge databases such as IntAct, mainly performed through manual curation, which is neither time nor cost-effective. We use the IntAct PPI database to create a distant supervised dataset annotated with interacting protein pairs, their corresponding PTM type, and associated abstracts from the PubMed database. We train an ensemble of BioBERT models - dubbed PPI-BioBERT-x10 to improve confidence calibration. We extend the use of ensemble average confidence approach with confidence variation to counteract the effects of class imbalance to extract high confidence predictions. The PPI-BioBERT-x10 model evaluated on the test set resulted in a modest F1-micro 41.3 (P =5 8.1, R = 32.1). However, by combining high confidence and low variation to identify high quality predictions, tuning the predictions for precision, we retained 19% of the test predictions with 100% precision. We evaluated PPI-BioBERT-x10 on 18 million PubMed abstracts and extracted 1.6 million (546507 unique PTM-PPI triplets) PTM-PPI predictions, and filter ~ 5700 (4584 unique) high confidence predictions. Of the 5700, human evaluation on a small randomly sampled subset shows that the precision drops to 33.7% despite confidence calibration and highlights the challenges of generalisability beyond the test set even with confidence calibration. We circumvent the problem by only including predictions associated with multiple papers, improving the precision to 58.8%. In this work, we highlight the benefits and challenges of deep learning-based text mining in practice, and the need for increased emphasis on confidence calibration to facilitate human curation efforts.
    Contrastive Code Representation Learning. (arXiv:2007.04973v4 [cs.LG] UPDATED)
    (2 min) Recent work learns contextual representations of source code by reconstructing tokens from their context. For downstream semantic understanding tasks like summarizing code in English, these representations should ideally capture program functionality. However, we show that the popular reconstruction-based BERT model is sensitive to source code edits, even when the edits preserve semantics. We propose ContraCode: a contrastive pre-training task that learns code functionality, not form. ContraCode pre-trains a neural network to identify functionally similar variants of a program among many non-equivalent distractors. We scalably generate these variants using an automated source-to-source compiler as a form of data augmentation. Contrastive pre-training improves JavaScript summarization and TypeScript type inference accuracy by 2% to 13%. We also propose a new zero-shot JavaScript code clone detection dataset, showing that ContraCode is both more robust and semantically meaningful. On it, we outperform RoBERTa by 39% AUROC in an adversarial setting and up to 5% on natural code.
    Stochastic Saddle Point Problems with Decision-Dependent Distributions. (arXiv:2201.02313v1 [math.OC])
    (2 min) This paper focuses on stochastic saddle point problems with decision-dependent distributions in both the static and time-varying settings. These are problems whose objective is the expected value of a stochastic payoff function, where random variables are drawn from a distribution induced by a distributional map. For general distributional maps, the problem of finding saddle points is in general computationally burdensome, even if the distribution is known. To enable a tractable solution approach, we introduce the notion of equilibrium points -- which are saddle points for the stationary stochastic minimax problem that they induce -- and provide conditions for their existence and uniqueness. We demonstrate that the distance between the two classes of solutions is bounded provided that the objective has a strongly-convex-strongly-concave payoff and Lipschitz continuous distributional map. We develop deterministic and stochastic primal-dual algorithms and demonstrate their convergence to the equilibrium point. In particular, by modeling errors emerging from a stochastic gradient estimator as sub-Weibull random variables, we provide error bounds in expectation and in high probability that hold for each iteration; moreover, we show convergence to a neighborhood in expectation and almost surely. Finally, we investigate a condition on the distributional map -- which we call opposing mixture dominance -- that ensures the objective is strongly-convex-strongly-concave. Under this assumption, we show that primal-dual algorithms converge to the saddle points in a similar fashion.
    A Theoretical Framework of Almost Hyperparameter-free Hyperparameter Selection Methods for Offline Policy Evaluation. (arXiv:2201.02300v1 [stat.ML])
    (2 min) We are concerned with the problem of hyperparameter selection of offline policy evaluation (OPE). OPE is a key component of offline reinforcement learning, which is a core technology for data-driven decision optimization without environment simulators. However, the current state-of-the-art OPE methods are not hyperparameter-free, which undermines their utility in real-life applications. We address this issue by introducing a new approximate hyperparameter selection (AHS) framework for OPE, which defines a notion of optimality (called selection criteria) in a quantitative and interpretable manner without hyperparameters. We then derive four AHS methods each of which has different characteristics such as convergence rate and time complexity. Finally, we verify effectiveness and limitation of these methods with a preliminary experiment.
    Nonlocal Kernel Network (NKN): a Stable and Resolution-Independent Deep Neural Network. (arXiv:2201.02217v1 [cs.LG])
    (2 min) Neural operators have recently become popular tools for designing solution maps between function spaces in the form of neural networks. Differently from classical scientific machine learning approaches that learn parameters of a known partial differential equation (PDE) for a single instance of the input parameters at a fixed resolution, neural operators approximate the solution map of a family of PDEs. Despite their success, the uses of neural operators are so far restricted to relatively shallow neural networks and confined to learning hidden governing laws. In this work, we propose a novel nonlocal neural operator, which we refer to as nonlocal kernel network (NKN), that is resolution independent, characterized by deep neural networks, and capable of handling a variety of tasks such as learning governing equations and classifying images. Our NKN stems from the interpretation of the neural network as a discrete nonlocal diffusion reaction equation that, in the limit of infinite layers, is equivalent to a parabolic nonlocal equation, whose stability is analyzed via nonlocal vector calculus. The resemblance with integral forms of neural operators allows NKNs to capture long-range dependencies in the feature space, while the continuous treatment of node-to-node interactions makes NKNs resolution independent. The resemblance with neural ODEs, reinterpreted in a nonlocal sense, and the stable network dynamics between layers allow for generalization of NKN's optimal parameters from shallow to deep networks. This fact enables the use of shallow-to-deep initialization techniques. Our tests show that NKNs outperform baseline methods in both learning governing equations and image classification tasks and generalize well to different resolutions and depths.
    PWM2Vec: An Efficient Embedding Approach for Viral Host Specification from Coronavirus Spike Sequences. (arXiv:2201.02273v1 [q-bio.GN])
    (3 min) COVID-19 pandemic, is still unknown and is an important open question. There are speculations that bats are a possible origin. Likewise, there are many closely related (corona-) viruses, such as SARS, which was found to be transmitted through civets. The study of the different hosts which can be potential carriers and transmitters of deadly viruses to humans is crucial to understanding, mitigating and preventing current and future pandemics. In coronaviruses, the surface (S) protein, or spike protein, is an important part of determining host specificity since it is the point of contact between the virus and the host cell membrane. In this paper, we classify the hosts of over five thousand coronaviruses from their spike protein sequences, segregating them into clusters of distinct hosts among avians, bats, camels, swines, humans and weasels, to name a few. We propose a feature embedding based on the well-known position-weight matrix (PWM), which we call PWM2Vec, and use to generate feature vectors from the spike protein sequences of these coronaviruses. While our embedding is inspired by the success of PWMs in biological applications such as determining protein function, or identifying transcription factor binding sites, we are the first (to the best of our knowledge) to use PWMs in the context of host classification from viral sequences to generate a fixed-length feature vector representation. The results on the real world data show that in using PWM2Vec, we are able to perform comparably well as compared to baseline models. We also measure the importance of different amino acids using information gain to show the amino acids which are important for predicting the host of a given coronavirus.
    Stereoscopic Universal Perturbations across Different Architectures and Datasets. (arXiv:2112.06116v2 [cs.CV] UPDATED)
    (2 min) We study the effect of adversarial perturbations of images on deep stereo matching networks for the disparity estimation task. We present a method to craft a single set of perturbations that, when added to any stereo image pair in a dataset, can fool a stereo network to significantly alter the perceived scene geometry. Our perturbation images are "universal" in that they not only corrupt estimates of the network on the dataset they are optimized for, but also generalize to stereo networks with different architectures across different datasets. We evaluate our approach on multiple public benchmark datasets and show that our perturbations can increase D1-error (akin to fooling rate) of state-of-the-art stereo networks from 1% to as much as 87%. We investigate the effect of perturbations on the estimated scene geometry and identify object classes that are most vulnerable. Our analysis on the activations of registered points between left and right images led us to find that certain architectural components, i.e. deformable convolution and explicit matching, can increase robustness against adversaries. We demonstrate that by simply designing networks with such components, one can reduce the effect of adversaries by up to 60.5%, which rivals the robustness of networks fine-tuned with costly adversarial data augmentation.
    Leveraging Scale-Invariance and Uncertainity with Self-Supervised Domain Adaptation for Semantic Segmentation of Foggy Scenes. (arXiv:2201.02588v1 [cs.CV])
    (2 min) This paper presents FogAdapt, a novel approach for domain adaptation of semantic segmentation for dense foggy scenes. Although significant research has been directed to reduce the domain shift in semantic segmentation, adaptation to scenes with adverse weather conditions remains an open question. Large variations in the visibility of the scene due to weather conditions, such as fog, smog, and haze, exacerbate the domain shift, thus making unsupervised adaptation in such scenarios challenging. We propose a self-entropy and multi-scale information augmented self-supervised domain adaptation method (FogAdapt) to minimize the domain shift in foggy scenes segmentation. Supported by the empirical evidence that an increase in fog density results in high self-entropy for segmentation probabilities, we introduce a self-entropy based loss function to guide the adaptation method. Furthermore, inferences obtained at different image scales are combined and weighted by the uncertainty to generate scale-invariant pseudo-labels for the target domain. These scale-invariant pseudo-labels are robust to visibility and scale variations. We evaluate the proposed model on real clear-weather scenes to real foggy scenes adaptation and synthetic non-foggy images to real foggy scenes adaptation scenarios. Our experiments demonstrate that FogAdapt significantly outperforms the current state-of-the-art in semantic segmentation of foggy images. Specifically, by considering the standard settings compared to state-of-the-art (SOTA) methods, FogAdapt gains 3.8% on Foggy Zurich, 6.0% on Foggy Driving-dense, and 3.6% on Foggy Driving in mIoU when adapted from Cityscapes to Foggy Zurich.
    Audio representations for deep learning in sound synthesis: A review. (arXiv:2201.02490v1 [cs.SD])
    (2 min) The rise of deep learning algorithms has led many researchers to withdraw from using classic signal processing methods for sound generation. Deep learning models have achieved expressive voice synthesis, realistic sound textures, and musical notes from virtual instruments. However, the most suitable deep learning architecture is still under investigation. The choice of architecture is tightly coupled to the audio representations. A sound's original waveform can be too dense and rich for deep learning models to deal with efficiently - and complexity increases training time and computational cost. Also, it does not represent sound in the manner in which it is perceived. Therefore, in many cases, the raw audio has been transformed into a compressed and more meaningful form using upsampling, feature-extraction, or even by adopting a higher level illustration of the waveform. Furthermore, conditional on the form chosen, additional conditioning representations, different model architectures, and numerous metrics for evaluating the reconstructed sound have been investigated. This paper provides an overview of audio representations applied to sound synthesis using deep learning. Additionally, it presents the most significant methods for developing and evaluating a sound synthesis architecture using deep learning models, always depending on the audio representation.
    Learning Multi-Tasks with Inconsistent Labels by using Auxiliary Big Task. (arXiv:2201.02305v1 [cs.LG])
    (2 min) Multi-task learning is to improve the performance of the model by transferring and exploiting common knowledge among tasks. Existing MTL works mainly focus on the scenario where label sets among multiple tasks (MTs) are usually the same, thus they can be utilized for learning across the tasks. While almost rare works explore the scenario where each task only has a small amount of training samples, and their label sets are just partially overlapped or even not. Learning such MTs is more challenging because of less correlation information available among these tasks. For this, we propose a framework to learn these tasks by jointly leveraging both abundant information from a learnt auxiliary big task with sufficiently many classes to cover those of all these tasks and the information shared among those partially-overlapped tasks. In our implementation of using the same neural network architecture of the learnt auxiliary task to learn individual tasks, the key idea is to utilize available label information to adaptively prune the hidden layer neurons of the auxiliary network to construct corresponding network for each task, while accompanying a joint learning across individual tasks. Our experimental results demonstrate its effectiveness in comparison with the state-of-the-art approaches.
    Inferring Turbulent Parameters via Machine Learning. (arXiv:2201.00732v1 [physics.flu-dyn] CROSS LISTED)
    (2 min) We design a machine learning technique to solve the general problem of inferring physical parameters from the observation of turbulent flows, a relevant exercise in many theoretical and applied fields, from engineering to earth observation and astrophysics. Our approach is to train the machine learning system to regress the rotation frequency of the flow's reference frame, from the observation of the flow's velocity amplitude on a 2d plane extracted from the 3d domain. The machine learning approach consists of a Deep Convolutional Neural Network (DCNN) of the same kind developed in computer vision. The training and validation datasets are produced by means of fully resolved direct numerical simulations. This study shows interesting results from two different points of view. From the machine learning point of view it shows the potential of DCNN, reaching good results on such a particularly complex problem that goes well outside the limits of human vision. Second, from the physics point of view, it provides an example on how machine learning can be exploited in data analysis to infer information that would be inaccessible otherwise. Indeed, by comparing DCNN with the other possible Bayesian approaches, we find that DCNN yields to a much higher inference accuracy in all the examined cases.
    Generalized Category Discovery. (arXiv:2201.02609v1 [cs.CV])
    (2 min) In this paper, we consider a highly general image recognition setting wherein, given a labelled and unlabelled set of images, the task is to categorize all images in the unlabelled set. Here, the unlabelled images may come from labelled classes or from novel ones. Existing recognition methods are not able to deal with this setting, because they make several restrictive assumptions, such as the unlabelled instances only coming from known - or unknown - classes and the number of unknown classes being known a-priori. We address the more unconstrained setting, naming it 'Generalized Category Discovery', and challenge all these assumptions. We first establish strong baselines by taking state-of-the-art algorithms from novel category discovery and adapting them for this task. Next, we propose the use of vision transformers with contrastive representation learning for this open world setting. We then introduce a simple yet effective semi-supervised $k$-means method to cluster the unlabelled data into seen and unseen classes automatically, substantially outperforming the baselines. Finally, we also propose a new approach to estimate the number of classes in the unlabelled data. We thoroughly evaluate our approach on public datasets for generic object classification including CIFAR10, CIFAR100 and ImageNet-100, and for fine-grained visual recognition including CUB, Stanford Cars and Herbarium19, benchmarking on this new setting to foster future research.
    On robust risk-based active-learning algorithms for enhanced decision support. (arXiv:2201.02555v1 [cs.LG])
    (2 min) Classification models are a fundamental component of physical-asset management technologies such as structural health monitoring (SHM) systems and digital twins. Previous work introduced \textit{risk-based active learning}, an online approach for the development of statistical classifiers that takes into account the decision-support context in which they are applied. Decision-making is considered by preferentially querying data labels according to \textit{expected value of perfect information} (EVPI). Although several benefits are gained by adopting a risk-based active learning approach, including improved decision-making performance, the algorithms suffer from issues relating to sampling bias as a result of the guided querying process. This sampling bias ultimately manifests as a decline in decision-making performance during the later stages of active learning, which in turn corresponds to lost resource/utility. The current paper proposes two novel approaches to counteract the effects of sampling bias: \textit{semi-supervised learning}, and \textit{discriminative classification models}. These approaches are first visualised using a synthetic dataset, then subsequently applied to an experimental case study, specifically, the Z24 Bridge dataset. The semi-supervised learning approach is shown to have variable performance; with robustness to sampling bias dependent on the suitability of the generative distributions selected for the model with respect to each dataset. In contrast, the discriminative classifiers are shown to have excellent robustness to the effects of sampling bias. Moreover, it was found that the number of inspections made during a monitoring campaign, and therefore resource expenditure, could be reduced with the careful selection of the statistical classifiers used within a decision-supporting monitoring system.
  • stat.ML updates on arXiv.org

    Learning to Transfer with von Neumann Conditional Divergence. (arXiv:2108.03531v2 [cs.LG] UPDATED)
    (0 min) The similarity of feature representations plays a pivotal role in the success of problems related to domain adaptation. Feature similarity includes both the invariance of marginal distributions and the closeness of conditional distributions given the desired response $y$ (e.g., class labels). Unfortunately, traditional methods always learn such features without fully taking into consideration the information in $y$, which in turn may lead to a mismatch of the conditional distributions or the mix-up of discriminative structures underlying data distributions. In this work, we introduce the recently proposed von Neumann conditional divergence to improve the transferability across multiple domains. We show that this new divergence is differentiable and eligible to easily quantify the functional dependence between features and $y$. Given multiple source tasks, we integrate this divergence to capture discriminative information in $y$ and design novel learning objectives assuming those source tasks are observed either simultaneously or sequentially. In both scenarios, we obtain favorable performance against state-of-the-art methods in terms of smaller generalization error on new tasks and less catastrophic forgetting on source tasks (in the sequential setup).
    Unified Field Theory for Deep and Recurrent Neural Networks. (arXiv:2112.05589v2 [cond-mat.dis-nn] UPDATED)
    (2 min) Understanding capabilities and limitations of different network architectures is of fundamental importance to machine learning. Bayesian inference on Gaussian processes has proven to be a viable approach for studying recurrent and deep networks in the limit of infinite layer width, $n\to\infty$. Here we present a unified and systematic derivation of the mean-field theory for both architectures that starts from first principles by employing established methods from statistical physics of disordered systems. The theory elucidates that while the mean-field equations are different with regard to their temporal structure, they yet yield identical Gaussian kernels when readouts are taken at a single time point or layer, respectively. Bayesian inference applied to classification then predicts identical performance and capabilities for the two architectures. Numerically, we find that convergence towards the mean-field theory is typically slower for recurrent networks than for deep networks and the convergence speed depends non-trivially on the parameters of the weight prior as well as the depth or number of time steps, respectively. Our method exposes that Gaussian processes are but the lowest order of a systematic expansion in $1/n$. The formalism thus paves the way to investigate the fundamental differences between recurrent and deep architectures at finite widths $n$.
    Towards Lower Bounds on the Depth of ReLU Neural Networks. (arXiv:2105.14835v3 [cs.LG] UPDATED)
    (2 min) We contribute to a better understanding of the class of functions that is represented by a neural network with ReLU activations and a given architecture. Using techniques from mixed-integer optimization, polyhedral theory, and tropical geometry, we provide a mathematical counterbalance to the universal approximation theorems which suggest that a single hidden layer is sufficient for learning tasks. In particular, we investigate whether the class of exactly representable functions strictly increases by adding more layers (with no restrictions on size). This problem has potential impact on algorithmic and statistical aspects because of the insight it provides into the class of functions represented by neural hypothesis classes. However, to the best of our knowledge, this question has not been investigated in the neural network literature. We also present upper bounds on the sizes of neural networks required to represent functions in these neural hypothesis classes.
    Applications of Signature Methods to Market Anomaly Detection. (arXiv:2201.02441v1 [q-fin.CP])
    (2 min) Anomaly detection is the process of identifying abnormal instances or events in data sets which deviate from the norm significantly. In this study, we propose a signatures based machine learning algorithm to detect rare or unexpected items in a given data set of time series type. We present applications of signature or randomized signature as feature extractors for anomaly detection algorithms; additionally we provide an easy, representation theoretic justification for the construction of randomized signatures. Our first application is based on synthetic data and aims at distinguishing between real and fake trajectories of stock prices, which are indistinguishable by visual inspection. We also show a real life application by using transaction data from the cryptocurrency market. In this case, we are able to identify pump and dump attempts organized on social networks with F1 scores up to 88% by means of our unsupervised learning algorithm, thus achieving results that are close to the state-of-the-art in the field based on supervised learning.
    A Theoretical Framework of Almost Hyperparameter-free Hyperparameter Selection Methods for Offline Policy Evaluation. (arXiv:2201.02300v1 [stat.ML])
    (2 min) We are concerned with the problem of hyperparameter selection of offline policy evaluation (OPE). OPE is a key component of offline reinforcement learning, which is a core technology for data-driven decision optimization without environment simulators. However, the current state-of-the-art OPE methods are not hyperparameter-free, which undermines their utility in real-life applications. We address this issue by introducing a new approximate hyperparameter selection (AHS) framework for OPE, which defines a notion of optimality (called selection criteria) in a quantitative and interpretable manner without hyperparameters. We then derive four AHS methods each of which has different characteristics such as convergence rate and time complexity. Finally, we verify effectiveness and limitation of these methods with a preliminary experiment.
    Lattice-Based Methods Surpass Sum-of-Squares in Clustering. (arXiv:2112.03898v2 [cs.LG] UPDATED)
    (2 min) Clustering is a fundamental primitive in unsupervised learning which gives rise to a rich class of computationally-challenging inference tasks. In this work, we focus on the canonical task of clustering d-dimensional Gaussian mixtures with unknown (and possibly degenerate) covariance. Recent works (Ghosh et al. '20; Mao, Wein '21; Davis, Diaz, Wang '21) have established lower bounds against the class of low-degree polynomial methods and the sum-of-squares (SoS) hierarchy for recovering certain hidden structures planted in Gaussian clustering instances. Prior work on many similar inference tasks portends that such lower bounds strongly suggest the presence of an inherent statistical-to-computational gap for clustering, that is, a parameter regime where the clustering task is statistically possible but no polynomial-time algorithm succeeds. One special case of the clustering task we consider is equivalent to the problem of finding a planted hypercube vector in an otherwise random subspace. We show that, perhaps surprisingly, this particular clustering model does not exhibit a statistical-to-computational gap, even though the aforementioned low-degree and SoS lower bounds continue to apply in this case. To achieve this, we give a polynomial-time algorithm based on the Lenstra--Lenstra--Lovasz lattice basis reduction method which achieves the statistically-optimal sample complexity of d+1 samples. This result extends the class of problems whose conjectured statistical-to-computational gaps can be "closed" by "brittle" polynomial-time algorithms, highlighting the crucial but subtle role of noise in the onset of statistical-to-computational gaps.
    Causal Mediation Analysis with Hidden Confounders. (arXiv:2102.11724v3 [stat.ML] UPDATED)
    (2 min) An important problem in causal inference is to break down the total effect of a treatment on an outcome into different causal pathways and to quantify the causal effect in each pathway. For instance, in causal fairness, the total effect of being a male employee (i.e., treatment) constitutes its direct effect on annual income (i.e., outcome) and the indirect effect via the employee's occupation (i.e., mediator). Causal mediation analysis (CMA) is a formal statistical framework commonly used to reveal such underlying causal mechanisms. One major challenge of CMA in observational studies is handling confounders, variables that cause spurious causal relationships among treatment, mediator, and outcome. Conventional methods assume sequential ignorability that implies all confounders can be measured, which is often unverifiable in practice. This work aims to circumvent the stringent sequential ignorability assumptions and consider hidden confounders. Drawing upon proxy strategies and recent advances in deep learning, we propose to simultaneously uncover the latent variables that characterize hidden confounders and estimate the causal effects. Empirical evaluations using both synthetic and semi-synthetic datasets validate the effectiveness of the proposed method. We further show the potentials of our approach for causal fairness analysis.
    Nonlocal Kernel Network (NKN): a Stable and Resolution-Independent Deep Neural Network. (arXiv:2201.02217v1 [cs.LG])
    (2 min) Neural operators have recently become popular tools for designing solution maps between function spaces in the form of neural networks. Differently from classical scientific machine learning approaches that learn parameters of a known partial differential equation (PDE) for a single instance of the input parameters at a fixed resolution, neural operators approximate the solution map of a family of PDEs. Despite their success, the uses of neural operators are so far restricted to relatively shallow neural networks and confined to learning hidden governing laws. In this work, we propose a novel nonlocal neural operator, which we refer to as nonlocal kernel network (NKN), that is resolution independent, characterized by deep neural networks, and capable of handling a variety of tasks such as learning governing equations and classifying images. Our NKN stems from the interpretation of the neural network as a discrete nonlocal diffusion reaction equation that, in the limit of infinite layers, is equivalent to a parabolic nonlocal equation, whose stability is analyzed via nonlocal vector calculus. The resemblance with integral forms of neural operators allows NKNs to capture long-range dependencies in the feature space, while the continuous treatment of node-to-node interactions makes NKNs resolution independent. The resemblance with neural ODEs, reinterpreted in a nonlocal sense, and the stable network dynamics between layers allow for generalization of NKN's optimal parameters from shallow to deep networks. This fact enables the use of shallow-to-deep initialization techniques. Our tests show that NKNs outperform baseline methods in both learning governing equations and image classification tasks and generalize well to different resolutions and depths.
    GCWSNet: Generalized Consistent Weighted Sampling for Scalable and Accurate Training of Neural Networks. (arXiv:2201.02283v1 [stat.ML])
    (2 min) We develop the "generalized consistent weighted sampling" (GCWS) for hashing the "powered-GMM" (pGMM) kernel (with a tuning parameter $p$). It turns out that GCWS provides a numerically stable scheme for applying power transformation on the original data, regardless of the magnitude of $p$ and the data. The power transformation is often effective for boosting the performance, in many cases considerably so. We feed the hashed data to neural networks on a variety of public classification datasets and name our method ``GCWSNet''. Our extensive experiments show that GCWSNet often improves the classification accuracy. Furthermore, it is evident from the experiments that GCWSNet converges substantially faster. In fact, GCWS often reaches a reasonable accuracy with merely (less than) one epoch of the training process. This property is much desired because many applications, such as advertisement click-through rate (CTR) prediction models, or data streams (i.e., data seen only once), often train just one epoch. Another beneficial side effect is that the computations of the first layer of the neural networks become additions instead of multiplications because the input data become binary (and highly sparse). Empirical comparisons with (normalized) random Fourier features (NRFF) are provided. We also propose to reduce the model size of GCWSNet by count-sketch and develop the theory for analyzing the impact of using count-sketch on the accuracy of GCWS. Our analysis shows that an ``8-bit'' strategy should work well in that we can always apply an 8-bit count-sketch hashing on the output of GCWS hashing without hurting the accuracy much. There are many other ways to take advantage of GCWS when training deep neural networks. For example, one can apply GCWS on the outputs of the last layer to boost the accuracy of trained deep neural networks.
    On robust risk-based active-learning algorithms for enhanced decision support. (arXiv:2201.02555v1 [cs.LG])
    (2 min) Classification models are a fundamental component of physical-asset management technologies such as structural health monitoring (SHM) systems and digital twins. Previous work introduced \textit{risk-based active learning}, an online approach for the development of statistical classifiers that takes into account the decision-support context in which they are applied. Decision-making is considered by preferentially querying data labels according to \textit{expected value of perfect information} (EVPI). Although several benefits are gained by adopting a risk-based active learning approach, including improved decision-making performance, the algorithms suffer from issues relating to sampling bias as a result of the guided querying process. This sampling bias ultimately manifests as a decline in decision-making performance during the later stages of active learning, which in turn corresponds to lost resource/utility. The current paper proposes two novel approaches to counteract the effects of sampling bias: \textit{semi-supervised learning}, and \textit{discriminative classification models}. These approaches are first visualised using a synthetic dataset, then subsequently applied to an experimental case study, specifically, the Z24 Bridge dataset. The semi-supervised learning approach is shown to have variable performance; with robustness to sampling bias dependent on the suitability of the generative distributions selected for the model with respect to each dataset. In contrast, the discriminative classifiers are shown to have excellent robustness to the effects of sampling bias. Moreover, it was found that the number of inspections made during a monitoring campaign, and therefore resource expenditure, could be reduced with the careful selection of the statistical classifiers used within a decision-supporting monitoring system.
    A Unified Statistical Learning Model for Rankings and Scores with Application to Grant Panel Review. (arXiv:2201.02539v1 [stat.ME])
    (2 min) Rankings and scores are two common data types used by judges to express preferences and/or perceptions of quality in a collection of objects. Numerous models exist to study data of each type separately, but no unified statistical model captures both data types simultaneously without first performing data conversion. We propose the Mallows-Binomial model to close this gap, which combines a Mallows' $\phi$ ranking model with Binomial score models through shared parameters that quantify object quality, a consensus ranking, and the level of consensus between judges. We propose an efficient tree-search algorithm to calculate the exact MLE of model parameters, study statistical properties of the model both analytically and through simulation, and apply our model to real data from an instance of grant panel review that collected both scores and partial rankings. Furthermore, we demonstrate how model outputs can be used to rank objects with confidence. The proposed model is shown to sensibly combine information from both scores and rankings to quantify object quality and measure consensus with appropriate levels of statistical uncertainty.
    Linear classifier, least-squares cost function, and outliers. (arXiv:1808.09222v2 [physics.data-an] UPDATED)
    (2 min) A set of introductory notes on the subject of data classification using a linear classifier and least-squares cost function, and the negative effect of the presence of outliers on the decision boundary of the linear discriminant. We also show how a simple scaling could make the outlier less significant, thereby obtaining a much better decision boundary. We present some numerical results.
    PAC-Bayesian Matrix Completion with a Spectral Scaled Student Prior. (arXiv:2104.08191v2 [stat.ML] UPDATED)
    (2 min) We study the problem of matrix completion in this paper. A spectral scaled Student prior is exploited to favour the underlying low-rank structure of the data matrix. We provide a thorough theoretical investigation for our approach through PAC-Bayesian bounds. More precisely, our PAC-Bayesian approach enjoys a minimax-optimal oracle inequality which guarantees that our method works well under model misspecification and under general sampling distribution. Interestingly, we also provide efficient gradient-based sampling implementations for our approach by using Langevin Monte Carlo. More specifically, we show that our algorithms are significantly faster than Gibbs sampler in this problem. To illustrate the attractive features of our inference strategy, some numerical simulations are conducted and an application to image inpainting is demonstrated.
    A framework for deep learning emulation of numerical models with a case study in satellite remote sensing. (arXiv:1910.13408v2 [cs.LG] UPDATED)
    (3 min) Numerical models based on physics represent the state-of-the-art in earth system modeling and comprise our best tools for generating insights and predictions. Despite rapid growth in computational power, the perceived need for higher model resolutions overwhelms the latest-generation computers, reducing the ability of modelers to generate simulations for understanding parameter sensitivities and characterizing variability and uncertainty. Thus, surrogate models are often developed to capture the essential attributes of the full-blown numerical models. Recent successes of machine learning methods, especially deep learning, across many disciplines offer the possibility that complex nonlinear connectionist representations may be able to capture the underlying complex structures and nonlinear processes in earth systems. A difficult test for deep learning-based emulation, which refers to function approximation of numerical models, is to understand whether they can be comparable to traditional forms of surrogate models in terms of computational efficiency while simultaneously reproducing model results in a credible manner. A deep learning emulation that passes this test may be expected to perform even better than simple models with respect to capturing complex processes and spatiotemporal dependencies. Here we examine, with a case study in satellite-based remote sensing, the hypothesis that deep learning approaches can credibly represent the simulations from a surrogate model with comparable computational efficiency. Our results are encouraging in that the deep learning emulation reproduces the results with acceptable accuracy and often even faster performance. We discuss the broader implications of our results in light of the pace of improvements in high-performance implementations of deep learning as well as the growing desire for higher-resolution simulations in the earth sciences.
    AugmentedPCA: A Python Package of Supervised and Adversarial Linear Factor Models. (arXiv:2201.02547v1 [stat.ML])
    (2 min) Deep autoencoders are often extended with a supervised or adversarial loss to learn latent representations with desirable properties, such as greater predictivity of labels and outcomes or fairness with respects to a sensitive variable. Despite the ubiquity of supervised and adversarial deep latent factor models, these methods should demonstrate improvement over simpler linear approaches to be preferred in practice. This necessitates a reproducible linear analog that still adheres to an augmenting supervised or adversarial objective. We address this methodological gap by presenting methods that augment the principal component analysis (PCA) objective with either a supervised or an adversarial objective and provide analytic and reproducible solutions. We implement these methods in an open-source Python package, AugmentedPCA, that can produce excellent real-world baselines. We demonstrate the utility of these factor models on an open-source, RNA-seq cancer gene expression dataset, showing that augmenting with a supervised objective results in improved downstream classification performance, produces principal components with greater class fidelity, and facilitates identification of genes aligned with the principal axes of data variance with implications to development of specific types of cancer.
    Path classification by stochastic linear recurrent neural networks. (arXiv:2108.03090v2 [stat.ML] UPDATED)
    (2 min) We investigate the functioning of a classifying biological neural network from the perspective of statistical learning theory, modelled, in a simplified setting, as a continuous-time stochastic recurrent neural network (RNN) with identity activation function. In the purely stochastic (robust) regime, we give a generalisation error bound that holds with high probability, thus showing that the empirical risk minimiser is the best-in-class hypothesis. We show that RNNs retain a partial signature of the paths they are fed as the unique information exploited for training and classification tasks. We argue that these RNNs are easy to train and robust and back these observations with numerical experiments on both synthetic and real data. We also exhibit a trade-off phenomenon between accuracy and robustness.
    Contrastive Code Representation Learning. (arXiv:2007.04973v4 [cs.LG] UPDATED)
    (2 min) Recent work learns contextual representations of source code by reconstructing tokens from their context. For downstream semantic understanding tasks like summarizing code in English, these representations should ideally capture program functionality. However, we show that the popular reconstruction-based BERT model is sensitive to source code edits, even when the edits preserve semantics. We propose ContraCode: a contrastive pre-training task that learns code functionality, not form. ContraCode pre-trains a neural network to identify functionally similar variants of a program among many non-equivalent distractors. We scalably generate these variants using an automated source-to-source compiler as a form of data augmentation. Contrastive pre-training improves JavaScript summarization and TypeScript type inference accuracy by 2% to 13%. We also propose a new zero-shot JavaScript code clone detection dataset, showing that ContraCode is both more robust and semantically meaningful. On it, we outperform RoBERTa by 39% AUROC in an adversarial setting and up to 5% on natural code.
    Time Series Anomaly Detection for Cyber-Physical Systems via Neural System Identification and Bayesian Filtering. (arXiv:2106.07992v2 [cs.LG] UPDATED)
    (2 min) Recent advances in AIoT technologies have led to an increasing popularity of utilizing machine learning algorithms to detect operational failures for cyber-physical systems (CPS). In its basic form, an anomaly detection module monitors the sensor measurements and actuator states from the physical plant, and detects anomalies in these measurements to identify abnormal operation status. Nevertheless, building effective anomaly detection models for CPS is rather challenging as the model has to accurately detect anomalies in presence of highly complicated system dynamics and unknown amount of sensor noise. In this work, we propose a novel time series anomaly detection method called Neural System Identification and Bayesian Filtering (NSIBF) in which a specially crafted neural network architecture is posed for system identification, i.e., capturing the dynamics of CPS in a dynamical state-space model; then a Bayesian filtering algorithm is naturally applied on top of the "identified" state-space model for robust anomaly detection by tracking the uncertainty of the hidden state of the system recursively over time. We provide qualitative as well as quantitative experiments with the proposed method on a synthetic and three real-world CPS datasets, showing that NSIBF compares favorably to the state-of-the-art methods with considerable improvements on anomaly detection in CPS.
    Acoustic Neighbor Embeddings. (arXiv:2007.10329v5 [eess.AS] UPDATED)
    (2 min) This paper proposes a novel acoustic word embedding called Acoustic Neighbor Embeddings where speech or text of arbitrary length are mapped to a vector space of fixed, reduced dimensions by adapting stochastic neighbor embedding (SNE) to sequential inputs. The Euclidean distance between coordinates in the embedding space reflects the phonetic confusability between their corresponding sequences. Two encoder neural networks are trained: an acoustic encoder that accepts speech signals in the form of frame-wise subword posterior probabilities obtained from an acoustic model and a text encoder that accepts text in the form of subword transcriptions. Compared to a triplet loss criterion, the proposed method is shown to have more effective gradients for neural network training. Experimentally, it also gives more accurate results with low-dimensional embeddings when the two encoder networks are used in tandem in a word (name) recognition task, and when the text encoder network is used standalone in an approximate phonetic matching task. In particular, in an isolated name recognition task depending solely on Euclidean nearest-neighbor search between the proposed embedding vectors, the recognition accuracy is identical to that of conventional finite state transducer(FST)-based decoding using test data with up to 1 million names in the vocabulary and 40 dimensions in the embeddings.
    Generalized quantum similarity learning. (arXiv:2201.02310v1 [quant-ph])
    (2 min) The similarity between objects is significant in a broad range of areas. While similarity can be measured using off-the-shelf distance functions, they may fail to capture the inherent meaning of similarity, which tends to depend on the underlying data and task. Moreover, conventional distance functions limit the space of similarity measures to be symmetric and do not directly allow comparing objects from different spaces. We propose using quantum networks (GQSim) for learning task-dependent (a)symmetric similarity between data that need not have the same dimensionality. We analyze the properties of such similarity function analytically (for a simple case) and numerically (for a complex case) and showthat these similarity measures can extract salient features of the data. We also demonstrate that the similarity measure derived using this technique is $(\epsilon,\gamma,\tau)$-good, resulting in theoretically guaranteed performance. Finally, we conclude by applying this technique for three relevant applications - Classification, Graph Completion, Generative modeling.
    Bayesian Online Change Point Detection for Baseline Shifts. (arXiv:2201.02325v1 [stat.ML])
    (2 min) In time series data analysis, detecting change points on a real-time basis (online) is of great interest in many areas, such as finance, environmental monitoring, and medicine. One promising means to achieve this is the Bayesian online change point detection (BOCPD) algorithm, which has been successfully adopted in particular cases in which the time series of interest has a fixed baseline. However, we have found that the algorithm struggles when the baseline irreversibly shifts from its initial state. This is because with the original BOCPD algorithm, the sensitivity with which a change point can be detected is degraded if the data points are fluctuating at locations relatively far from the original baseline. In this paper, we not only extend the original BOCPD algorithm to be applicable to a time series whose baseline is constantly shifting toward unknown values but also visualize why the proposed extension works. To demonstrate the efficacy of the proposed algorithm compared to the original one, we examine these algorithms on two real-world data sets and six synthetic data sets.
    Similarities and Differences between Machine Learning and Traditional Advanced Statistical Modeling in Healthcare Analytics. (arXiv:2201.02469v1 [cs.LG])
    (2 min) Data scientists and statisticians are often at odds when determining the best approach, machine learning or statistical modeling, to solve an analytics challenge. However, machine learning and statistical modeling are more cousins than adversaries on different sides of an analysis battleground. Choosing between the two approaches or in some cases using both is based on the problem to be solved and outcomes required as well as the data available for use and circumstances of the analysis. Machine learning and statistical modeling are complementary, based on similar mathematical principles, but simply using different tools in an overall analytics knowledge base. Determining the predominant approach should be based on the problem to be solved as well as empirical evidence, such as size and completeness of the data, number of variables, assumptions or lack thereof, and expected outcomes such as predictions or causality. Good analysts and data scientists should be well versed in both techniques and their proper application, thereby using the right tool for the right project to achieve the desired results.
    Nonparametric learning for impulse control problems. (arXiv:1909.09528v3 [math.OC] UPDATED)
    (2 min) One of the fundamental assumptions in stochastic control of continuous time processes is that the dynamics of the underlying (diffusion) process is known. This is, however, usually obviously not fulfilled in practice. On the other hand, over the last decades, a rich theory for nonparametric estimation of the drift (and volatility) for continuous time processes has been developed. The aim of this paper is bringing together techniques from stochastic control with methods from statistics for stochastic processes to find a way to both learn the dynamics of the underlying process and control in a reasonable way at the same time. More precisely, we study a long-term average impulse control problem, a stochastic version of the classical Faustmann timber harvesting problem. One of the problems that immediately arises is an exploration-exploitation dilemma as is well known for problems in machine learning. We propose a way to deal with this issue by combining exploration and exploitation periods in a suitable way. Our main finding is that this construction can be based on the rates of convergence of estimators for the invariant density. Using this, we obtain that the average cumulated regret is of uniform order $O({T^{-1/3}})$.
    Optimality in Noisy Importance Sampling. (arXiv:2201.02432v1 [stat.ML])
    (2 min) In this work, we analyze the noisy importance sampling (IS), i.e., IS working with noisy evaluations of the target density. We present the general framework and derive optimal proposal densities for noisy IS estimators. The optimal proposals incorporate the information of the variance of the noisy realizations, proposing points in regions where the noise power is higher. We also compare the use of the optimal proposals with previous optimality approaches considered in a noisy IS framework.
  • Google AI Blog

    Google Research: Themes from 2021 and Beyond
    (40 min) Posted by Jeff Dean, Senior Fellow and SVP of Google Research, on behalf of the entire Google Research community Over the last several decades, I've witnessed a lot of change in the fields of machine learning (ML) and computer science. Early approaches, which often fell short, eventually gave rise to modern approaches that have been very successful. Following that long-arc pattern of progress, I think we'll see a number of exciting advances over the next several years, advances that will ultimately benefit the lives of billions of people with greater impact than ever before. In this post, I’ll highlight five areas where ML is poised to have such impact. For each, I’ll discuss related research (mostly from 2021) and the directions and progress we’ll likely see in the next few years. …
  • Data Science Central

    DSC Weekly Digest 11 January 2022: Houston, We Have Lift-off
    (5 min) If you are a long-time reader of Data Science Central, you may notice, the next time you visit, that things look a bit … different than they did before. The site is less cluttered than before, with more space than was the case less last week. There’s a basic consistency to the site that’s new,… Read More »DSC Weekly Digest 11 January 2022: Houston, We Have Lift-off The post DSC Weekly Digest 11 January 2022: Houston, We Have Lift-off appeared first on Data Science Central.
  • The Official NVIDIA Blog

    NVIDIA Named America’s Best Place to Work on Latest Glassdoor List
    (2 min) NVIDIA is America’s best place to work, according to Glassdoor’s just-issued list of best employers for 2022. Amid a global pandemic that has affected every workplace, NVIDIA was ranked No. 1 on Glassdoor’s 14th annual Best Places to Work list for large US companies. The award is based on anonymous employee feedback covering thousands of Read article > The post NVIDIA Named America’s Best Place to Work on Latest Glassdoor List appeared first on The Official NVIDIA Blog.
  • MIT News - Artificial intelligence

    Q&A: Dolapo Adedokun on computer technology, Ireland, and all that jazz
    (8 min) MIT EECS student and Mitchell Scholar hopes to play music in Dublin while working on his MS in intelligent systems.
    The promise and pitfalls of artificial intelligence explored at TEDxMIT event
    (7 min) MIT scientists discuss the future of AI with applications across many sectors, as a tool that can be both beneficial and harmful.
  • AWS Machine Learning Blog

    Industrial automation at Tyson with computer vision, AWS Panorama, and Amazon SageMaker
    (9 min) This is the first in a two-part blog series on how Tyson Foods, Inc., is utilizing Amazon SageMaker and AWS Panorama to automate industrial processes at their meat packing plants by bringing the benefits of artificial intelligence applications at the edge. In part one, we discuss an inventory counting application for packaging lines. In part […]
  • Becoming Human: Artificial Intelligence Magazine - Medium

    7 Key Questions You Need To Answer Before Implementing Ethical AI
    (4 min) According to ethical AI statistics, 90% of organizations who have adopted ethical artificial intelligence said they know about at least one…
    Google Flu Trends Explained
    (4 min) Introduction
    2022 SEO: A Robust Future!
    (5 min) SEO = Ever-Changing with a Fast-Paced Agenda! Continue reading on Becoming Human: Artificial Intelligence Magazine »
  • John D. Cook

    Random sampling to save money
    (1 min) I was stunned when my client said that a database query that I asked them to run would cost the company $100,000 per year. I had framed my question in the most natural way, not thinking that at the company’s scale it would be worth spending some time thinking about the query. Things have somewhat […] Random sampling to save money first appeared on John D. Cook.
    Greco-Latin squares and magic squares
    (2 min) Suppose you create an n × n Latin square filled with the first n letters of the Roman alphabet. This means that each letter appears exactly once in each row and in each column. You could repeat the same exercise only using the Greek alphabet. Is it possible to find two n × n Latin […] Greco-Latin squares and magic squares first appeared on John D. Cook.
    Latin squares and 3D chess
    (2 min) In a n×n Latin square, each of the numbers 1 through n appears exactly once in each row and column. For example, the 5 × 5 square below is a Latin square. If we placed a rook on each square numbered 1, the rooks would not attack each other since no two rooks would be […] Latin squares and 3D chess first appeared on John D. Cook.
    New R book
    (1 min) Five years ago I recommended the book Learning Base R. Here’s the last paragraph of my review: Now there are more books on R, and some are more approachable to non-statisticians. The most accessible one I’ve seen so far is Learning Base R by Lawrence Leemis. It gets into statistical applications of R—that is ultimately why […] New R book first appeared on John D. Cook.
  • Artificial Intelligence

    Chunkmogrify - Facial Editing AI with Masking
    (0 min) submitted by /u/cloud_weather [link] [comments]
    CMU Researchers Propose A Computer Vision-Based Approach With Data-Frugal Deep Learning To Optimize Microstructure Imaging
    (1 min) Materials processing is the process of turning raw materials into final items through a sequence of phases or “unit operations.” The activities entail a series of industrial processes, including various mechanical and chemical methods, which are often carried out in big numbers or batches. Material processing required extensive analysis and classification of complicated microstructures for quality control. For example, the proportion of lath-type bainite in various high-strength steels affects the material’s characteristics. However, recognizing bainite in microstructural images takes time and money because researchers must first employ two types of microscopy to get a closer look, then rely on their own skills to identify bainitic regions. Continue Reading | Paper submitted by /u/ai-lover [link] [comments]
    Awesome R&D content (with code!) on Computer Vision News of January 2022.
    (1 min) Many great articles about AI, Deep Learning, Computer Vision and more... HTML5 version (recommended) PDF version Dilbert on page 2. Free subscription on page 70. Enjoy! https://preview.redd.it/a0dxqnpqx3b81.jpg?width=700&format=pjpg&auto=webp&s=ef2584405c554b006bb9cc970f4b90686034d8c5 submitted by /u/Gletta [link] [comments]
    Superintelligent Utility Monster thought experiment
    (1 min) submitted by /u/HumanSeeing [link] [comments]
    Any text based games like ai dungeon
    (1 min) submitted by /u/roblox22y [link] [comments]
    Is there a similar AI which is able to make sketches out of images like the one below?
    (0 min) https://youtu.be/Peccbcj8Ibs : submitted by /u/xXLisa28Xx [link] [comments]
    Is Low-Code the Future of Programming?
    (0 min) submitted by /u/Beautiful-Credit-868 [link] [comments]
    Q-learning and Sarsa in grid environment for short-term vs long-term rewards
    (1 min) I created my custom, grid(7 by 7) environment to apply RL algorithms. I chose Q-learning and Sarsa, in particular. The grid environment consists of 3 types of terminating states: states with negative reward(-100), state with maximum reward(100) and 2 states with half reward(50). The main goal of training is for the agent to avoid states with negative rewards and to prefer long-term reward(100) over short-term half reward(50). The trained agent works weirdly when the half-rewarded state is closer to the main reward, however, if the half-rewarded states are not that close to the main reward, then both algorithms efficiently train the agent to only go to the main reward. So, from what I understand, the result is based on the position of the half-rewarded state. My hyperparameters for both Q-learning and Sarsa are the following: epsilon=1(that is gradually decayed with a linear function),gamma=0.99(I read that for the agent to learn the main reward, the gamma should be high, 0.9-0.99 approx.),alpha=0.1 Can the problem be my environment? I'm confused because both algorithms work well without the half-rewarded state. So the problem is that depending on where the half-rewarded states are located, the algorithms sometimes don't train the agent to choose the long-term reward. If someone had a similar problem, I would really appreciate it if you could share how you solved it. submitted by /u/studentani [link] [comments]
    Predictions about the future and A.I.
    (5 min) As time passes by Artificial Intelligence (A.I.) technology is becoming a bigger thing than ever today. Even though A.I. does come with its benefits like automated machines and etc. I do have a few concerns with A.I. and the negative things about it. A few things I question are firstly self-driving cars. Now I know that many people are taking advantage of this relatively new feature, however, we should consider the fact that this is a machine doing the driving. In most cases this system is reliable, but we should still account for the small number of fatalities and issues that came from it malfunctioning. In addition to that, another thing I would like to discuss is surgeries. I believe that if AI gets to the point where they are providing health care, they should do simple tasks like gr…
  • Machine Learning

    Time Series with high seasonal period [Discussion]
    (0 min) What might be the best model(s) for daily time series with annual seasonality (period = 365) ? I used auto_arima and it doesn’t support high seasonal periods. It takes forever to fit and then throws a memory error. I read a little about Fourier but not sure how to use that in Python. submitted by /u/CheeseBurgersx [link] [comments]
    [D] Interview - This Team won the Minecraft RL BASALT Challenge! (Paper Explanation & Interview with the authors)
    (1 min) https://youtu.be/a4P8v8lGFPw The MineRL BASALT challenge has no reward functions or technical descriptions of what's to be achieved. Instead, the goal of each task is given as a short natural language string, and the agent is evaluated by a team of human judges who rate both how well the goal has been fulfilled, as well as how human-like the agent behaved. In this video, I interview KAIROS, the winning team of the 2021 challenge, and discuss how they used a combination of machine learning, efficient data collection, hand engineering, and a bit of knowledge about Minecraft to beat all other teams. ​ OUTLINE: 0:00 - Introduction 4:10 - Paper Overview 11:15 - Start of Interview 17:05 - First Approach 20:30 - State Machine 26:45 - Efficient Label Collection 30:00 - Navigation Policy 38:15 - Odometry Estimation 46:00 - Pain Points & Learnings 50:40 - Live Run Commentary 58:50 - What other tasks can be solved? 1:01:55 - What made the difference? 1:07:30 - Recommendations & Conclusion 1:11:10 - Full Runs: Waterfall 1:12:40 - Full Runs: Build House 1:17:45 - Full Runs: Animal Pen 1:20:50 - Full Runs: Find Cave ​ Paper: https://arxiv.org/abs/2112.03482 Code: https://github.com/viniciusguigo/kairos_minerl_basalt Challenge Website: https://minerl.io/basalt/ ​ Paper Title: Combining Learning from Human Feedback and Knowledge Engineering to Solve Hierarchical Tasks in Minecraft submitted by /u/ykilcher [link] [comments]
    [D] Build system for machine learning?
    (1 min) Hey all, I'm setting up a new machine learning project and I know I'm going to be collecting data over time (and making model improvements). Is there a best way to set up a build system to generate models using my training script, eval on test data, all relatively automatically? Curious what other people have tried and liked (or disliked). Thanks! submitted by /u/smp2005throwaway [link] [comments]
    [D] Significance of MLM loss when pre-training Transformers for language modeling
    (1 min) What significance does the MLM loss have when I'm pre-training Transformers for language modeling from scratch or continue the training of a pre-trained model on a different dataset? Apart from initial spikes, I don't really see any significant movement in the loss curves. Most papers just evaluate on downstream tasks such as NER or NLI. Is the MLM loss really not that well interpretable? submitted by /u/optimized-adam [link] [comments]
    [D] How is confidence bound derived for LinUCB
    (2 min) Hi! I am reading the paper A Contextual-Bandit Approach to Personalized News Article Recommendation. I struggle to understand how the confidence bound is derived: In particular, I do not understand the term under the square root. Why is this a standard deviation? Also, what values does alpha take or correspond to. Would appreciate any hints/advice. I did read the paper, but could not get the interpretation for the derivation of the upper confidence bound. Thanks https://i.redd.it/2bkd1rkro2b81.gif Latex code for the picture |x_{t,a}^T\hat{\theta_a} - E[r_{t,a}|x_{t,a}]| \leq \alpha \sqrt{x_{t,a}^T(D_a^TD_a + I_d)^{-1}x_{t,a}} \\~\\ \text{For any }\delta > 0 \text{ and } x_{t,a} \in R^d \text{, where } \alpha = 1 + \sqrt{ln(2/\delta)/2} \\~\\ a_t = \arg \max_{a \ in A_t}(x_{t,a}^T\hat{\theta_a} + \alpha \sqrt{x_{t,a}^T(D_a^TD_a + I_d)^{-1}x_{t,a}}) submitted by /u/denis56 [link] [comments]
    [P] txtai 4.0 released - semantic search with SQL, content storage, object storage, reindexing and more
    (1 min) txtai 4.0 has been released with a number of new features. https://preview.redd.it/qm9tgjkum2b81.png?width=1061&format=png&auto=webp&s=22c90382d1dee780530e69635242078a26b8305c Content Storage - Content can now be stored alongside embeddings vectors. No longer required to have external data storage. Query with SQL - txtai supports both natural language queries and SQL queries embeddings.search("feel good story") SELECT id, text, score FROM txtai WHERE similar('feel good story') AND score >= 0.15 Object Storage - Store binary objects alongside embeddings vectors Reindex - Indexes can be rebuilt using stored content, no need to resend data to txtai Index Compression - Indexes can be compressed using GZ/XZ/BZ2/ZIP External Vectors - Use external vector models from an API or an external library. Centralize building vectors on GPU servers leaving index servers to be powered by more modest hardware. More information can be found in following links. GitHub Project 4.0 Release Notes What's new in txtai 4.0 Documentation Examples submitted by /u/davidmezzetti [link] [comments]
    [Project] A dataset of parse trees generated from abstracts of arXiv articles
    (1 min) (Admittedly cross-posting from r/LanguageTechnology) https://github.com/l74d/scholarly-trees I have put up some (not so few) parse trees online as a dataset. Not something as substantial as Penn Treebank, since the trees have not been human-edited. But it is still a lot more parse trees, than those from Penn, to feed into your later-stage NLP algorithms, free of charge or hassle. The current format is straight from where they were generated. Suggestions of alternative formats based on ease of use would be heavily appreciated! submitted by /u/l74d [link] [comments]
    [R] "Hot under the collar: A latent measure of interstate hostility"
    (1 min) Tl; dr: A Bayesian ideal-point model for modeling crises in international relations. Abstract: "The majority of studies on international conflict escalation use a variety of measures of hostility including the use of force, reciprocity, and the number of fatalities. The use of different measures, however, leads to different empirical results and creates difficulties when testing existing theories of interstate conflict. Furthermore, hostility measures currently used in the conflict literature are ill suited to the task of identifying consistent predictors of international conflict escalation. This article presents a new dyadic latent measure of interstate hostility, created using a Bayesian item-response theory model and conflict data from the Militarized Interstate Dispute (MID) and Phoenix political event datasets. This model (1) provides a more granular, conceptually precise, and validated measure of hostility, which incorporates the uncertainty inherent in the latent variable; and (2) solves the problem of temporal variation in event data using a varying-intercept structure and human-coded data as a benchmark against which biases in machine-coded data are corrected. In addition, this measurement model allows for the systematic evaluation of how existing measures relate to the construct of hostility. The presented model will therefore enhance the ability of researchers to understand factors affecting conflict dynamics, including escalation and de-escalation processes." Paper: https://journals.sagepub.com/doi/pdf/10.1177/0022343320962546 Non-paywalled: http://zterechshenko.com/assets/ZTerechshenko_MA.pdf submitted by /u/bikeskata [link] [comments]
    [D] large query set few shot learning
    (1 min) Is there a research field specific to image classification/ embedding learning where we have a large number of classes +2k and only few samples per class ~1-5 ? What i noticed in few shot learning papers is the query set is always very small 20-way 1-shot or something similar. Is there any paper you have seen that i can check ? Thank you! submitted by /u/flow_smith [link] [comments]
    [P] Tiny Video Search Engine Using OpenAI's CLIP
    (1 min) Tiny Video Search Engine Using OpenAI's CLIP A fun project that I did to try out OpenAI's CLIP model. In this article, I describe a tiny video search engine and indexer that will let you search through a video with descriptive "natural language" queries and find matching frames of video. All the code is included in a Google Colab Notebook. So even if you don't have your own cuda-capable GPU, you can easily run the code yourself without setting up anything on your own computer. submitted by /u/CakeStandard3577 [link] [comments]
    [P] TensorFlow Similarity now self-supervised training
    (1 min) Happy new year :) Very happy to announce that as part of the 0.15 release, TensorFlow Similarity now support self-supervised learning using STOA algorithms. To help you get started we included in the release a detailed getting started notebook that you can run in Colab. This notebook shows you how to use SimSiam self-supervised pre-training to almost double the accuracy compared to a model trained from scratch on CIFAR 10. Hope you will find this release useful to your research and experimentations :) ​ https://preview.redd.it/ulodxbq43za81.jpg?width=1600&format=pjpg&auto=webp&s=188a51095ea122573458856db9a524491743a587 submitted by /u/ebursztein [link] [comments]
    [R]:Twitter Crawl Data for Research.
    (1 min) Hello All I am looking to crawl data for academic research (most likely need to release/open-source the dataset). conclude you guys know the license? (I have already read their webpage, terms and condition), however, I don't find too many open source twitter data set, wondering if there ‏‏‎ is any hidden terms that I am not awared off? submitted by /u/hushedBobolink7 [link] [comments]
  • Neural Networks, Deep Learning and Machine Learning

    Chunkmogrify - Facial Editing AI with Masking
    (0 min) submitted by /u/cloud_weather [link] [comments]
    From 53% to 95% acc - Real vs Fake Faces Classification | Fine-tuning EfficientNet (Github in comment)
    (1 min) submitted by /u/oFlamingo [link] [comments]
    How do I make a Neural Net AI with Cheat Engine?
    (2 min) I saw somebody explain how they made an AI for Street Fighter 5 via Cheat Engine, but I am unsure how they exactly managed this. I’m making one for a different game, can someone please help me? submitted by /u/GokuKing922 [link] [comments]
  • Reinforcement Learning

    Immediate reward and Final reward
    (1 min) Hello guys I hope you are doing well. I am struggling with designing a deep reinforcement learning model, I want to make a model that works with immediate rewards and a final reward at the end of each episode. I am wondering if the final reward must have the same distribution as the immediate rewards? because when I designed an immediate reward that is in the range of [-3,-2] and a final reward in the range of [-1,0] , the agent learns to minimize the reward as the figure below. My second question is how the agent differs the final reward and the immediate one? Thank you ! ​ https://preview.redd.it/pggvt2rzk4b81.png?width=467&format=png&auto=webp&s=d90b250c2242bc9d8936c8958f3a8258fd92f70a submitted by /u/GuavaAgreeable208 [link] [comments]
    How to improve PPO on BipedalWalker-v3?
    (1 min) I implemented basic PPO on OpenAI's BipedalWalker-v3 and after tuning hyperparameters i got maximum mean of 230 over last 100 episodes. Do you know some improvements to my basic PPO implementation I can use to get better result or closer to reward of 300? submitted by /u/TheGuy839 [link] [comments]
    Is there any method to obtain the true optimal Q-function for low dimensional continuous problem (for exmaple, cartpole)?
    (1 min) submitted by /u/Rich_Beautiful [link] [comments]
    First try at Blogging! An Introduction to Concepts in Reinforcement Learning
    (1 min) Tried my hand at blogging after procrastinating a lot. Hope y'all find it interesting Introduction to Concepts in Reinforcement Learning For any corrections, criticisms or clarifications feel free to hmu :) submitted by /u/needANewPlague_ [link] [comments]
    What advantage does Actor critic methods /Deep RL give over Naive Policy gradient algorithm like Reinforce.
    (1 min) submitted by /u/aabra__ka__daabra [link] [comments]
    help understanding ppo algorithm
    (2 min) i'm having a hard time understanding the PPO algorithm (intuitively). here are the questions i have: what is the gist of calculating the advantage? i know that it's calculating the "betterness" of the state/action compared to what the network was actually expecting, but then why is the advantage formula: reward + values_of_next_state - value_of_current_state? how does this tell us how advantageous the current state/action is? if the value network is supposed to output the expected/discounted returns (i hope i am not wrong here, or is it supposed to return the reward for just the current state?) then how does it stack up against continuous obs spaces? (eg: in a car game where the input is pixels, i can sample a random state from the whole game then how is it supposed to calculate the discounted returns? it doesn't know how far the map goes further or for that matter how far it is from the start of the track). how does the gradient affect the action? meaning in a multi discrete action space how does the gradient affect only the action that i took? instead of backproping the grad to all actions? and finally this is how i'm currently calculating returns and advantages (for a continuous action space and pixel based inputs to the CNN): self.values[self.ptr] = next_value deltas = self.rewards + self.gamma * self.values[1:] - self.values[:-1] self.advantage = self.discounted_sum(deltas, self.gamma * self.lam) self.returns = self.discounted_sum(self.rewards, self.gamma) am i doing it right? submitted by /u/jedi1026 [link] [comments]
    How do I use a Baselines algorithm such as A2C or PPO, but with a custom reward function? (OpenAI Retro)
    (1 min) Hi. I used neat-python to make an AI for Pokemon Red, but it doesn't get very far. The reward function I made gives it 10 reward every time the RAM values change, as checked every 10 frames. (I made a list of what RAM values it should watch for). I did this because I wanted to try a "curiosity" reward. Since the NEAT AI isn't getting very far, I decided to try a different algorithm that is not genetic, hoping that it will perform better. I have my eyes on A2C and PPO but I cannot find a way to make a custom reward function for them. It seems that they use the environment's reward function, which seems to be only editable in Lua. Can someone give me pointers on how to implement a custom reward function for reinforcement learning that is not NEAT? I just need it to take in a list of inputs, output a list, and learn from those and the rewards it gets. I've tried to code the reward function in Lua but I was having issues, so I'd prefer it to be in Python. submitted by /u/Unsightedmetal6 [link] [comments]
    Whats the difference between feature vectors and state space reinforcement learning?
    (1 min) I encountered both words frequently. They seem to refer to same things. I am so confused. Anybody care to help clarify? Thanks. submitted by /u/Asleep_Donut1382 [link] [comments]

2022-01-10

  • Data Science Central

    Metahealth: Coming to a Doctor’s Office Near You?
    (4 min) The Metaverse is making VR doctor’s visits a reality. VR in healthcare has widespread applications from surgical training to patient consultations. Privacy challenges are a major stumbling block. The last time I visited a doctor’s office was in October, to get my MRI results for a broken leg. It was a torturous 45-minute wait in… Read More »Metahealth: Coming to a Doctor’s Office Near You? The post Metahealth: Coming to a Doctor’s Office Near You? appeared first on Data Science Central.
  • The Official NVIDIA Blog

    Leading HPC Software Company Bright Computing Joins NVIDIA
    (2 min) Bright Computing, a leader in software for managing high performance computing systems used by more than 700 organizations worldwide, is now part of NVIDIA. Companies in healthcare, financial services, manufacturing and other markets use its tool to set up and run HPC clusters, groups of servers linked by high-speed networks into a single unit. Its Read article > The post Leading HPC Software Company Bright Computing Joins NVIDIA appeared first on The Official NVIDIA Blog.
    AI Startup Speeds Up Derivative Models for Bank of Montreal
    (3 min) To make the best portfolio decisions, banks need to accurately calculate values of their trades, while factoring in uncertain external risks. This requires high-performance computing power to run complex derivatives models — which find fair prices for financial contracts — as close to real time as possible. “You don’t want to trade today on yesterday’s Read article > The post AI Startup Speeds Up Derivative Models for Bank of Montreal appeared first on The Official NVIDIA Blog.
    Cloud Control: Production Studio Taylor James Elevates Remote Workflows With NVIDIA Technology
    (3 min) WFH was likely one the most-used acronyms of the past year, with more businesses looking to enhance their employees’ remote experiences than ever. Creative production studio Taylor James found a cloud-based solution to maintain efficiency and productivity — even while working remotely — with NVIDIA RTX Virtual Workstations on AWS. With locations in New York, Read article > The post Cloud Control: Production Studio Taylor James Elevates Remote Workflows With NVIDIA Technology appeared first on The Official NVIDIA Blog.
  • MIT News - Artificial intelligence

    Physics and the machine-learning “black box”
    (6 min) In 2.C161, George Barbastathis demonstrates how mechanical engineers can use their knowledge of physical systems to keep algorithms in check and develop more accurate predictions.
  • AWS Machine Learning Blog

    Develop an automatic review image inspection service with Amazon SageMaker
    (8 min) This is a guest post by Jihye Park, a Data Scientist at MUSINSA.  MUSINSA is one of the largest online fashion platforms in South Korea, serving 8.4M customers and selling 6,000 fashion brands. Our monthly user traffic reaches 4M, and over 90% of our demographics consist of teens and young adults who are sensitive to […]
    How ReliaQuest uses Amazon SageMaker to accelerate its AI innovation by 35x
    (5 min) Cybersecurity continues to be a top concern for enterprises. Yet the constantly evolving threat landscape that they face makes it harder than ever to be confident in their cybersecurity protections. To address this, ReliaQuest built GreyMatter, an Open XDR-as-a-Service platform that brings together telemetry from any security and business solution, whether on-premises or in one or multiple clouds, to unify detection, investigation, response, and resilience. In 2021, ReliaQuest turned to AWS to help it enhance its artificial intelligence (AI) capabilities and build new features faster.
  • Becoming Human: Artificial Intelligence Magazine - Medium

    ML-Based Energy Consumption Forecasting
    (4 min) Machine Learning has diverse applications in business, but it is probably the most effective for forecasting the future demand of a…
    Gradient-Based Optimization Demistified
    (4 min) In real life, you often have to deal with things you don’t completely understand. For instance, you drive a car, not knowing how the engine…
    How to achieve data interoperability in healthcare: tips from ITRex
    (11 min) Health data is notoriously difficult to share. Due to its sensitive nature, it requires more privacy and security than any other data…
  • John D. Cook

    Computational asceticism
    (3 min) A while back I wrote about computational survivalism, being prepared to work productively in a restricted environment. The first time I ran into computational survivalism was when someone said to me “I prefer Emacs, but I use vi because I only want to use tools I can count on being installed everywhere.” I thought that […] Computational asceticism first appeared on John D. Cook.
  • Artificial Intelligence

    Weekly China AI News: Alibaba Losses Head of Self-Driving Unit; AI Creates Images from Text, and Vice Versa; Geely, Mobileye to Build Self-Driving EV for Consumer
    (1 min) submitted by /u/trcytony [link] [comments]
    Learn Machine Learning from ground up!
    (0 min) Hi! This is my attempt to demystify the minutae of Machine Learning and Deep Learning. The goal is for the information to be complete and intuitive. https://youtube.com/channel/UC6sjv3MMPEoFCioD6-Ham4w I'd love to hear feedbacks. Please feel free to DM. Also, please share and subscribe if you like the content. MachineLearning DeepLearning #MachineLearningwithHarsh #YouTube submitted by /u/mr-minion [link] [comments]
    UC Sandiego Researchers Propose A Controllable Voice Cloning Method That Allows Fine-Grained Control Over Various Style Aspects Of The Synthesized Speech For An Unseen Speaker
    (1 min) Text-to-Speech (TTS) synthesis is achieved using current voice cloning methods for a new voice. They do not, however, manipulate the expressiveness of synthesized sounds. The task of learning to synthesize the speech of an unseen speaker with the least amount of training is known as voice cloning. UC San Diego researchers propose a Controllable voice cloning method that offers fine-grained control over many style features of synthetic speech for an unseen speaker. The voice synthesis model is explicitly conditioned on a speaker encoding, pitch contour, and latent style tokens during training. Continue Reading Paper: https://arxiv.org/pdf/2102.00151.pdf submitted by /u/ai-lover [link] [comments]
    Last Week in AI - cuddly robo-dogs, self-farming farms, AI-crafted craft beer recipes, and more!
    (0 min) submitted by /u/regalalgorithm [link] [comments]
    [R] Counterfactual Memorization in Language Models: Distinguishing Rare from Common Memorization
    (1 min) A team from Google Research, University of Pennsylvania and Cornell University proposes a principled perspective to filter out common memorization for LMs, introducing "counterfactual memorization" to measure the expected change in a model’s prediction and distinguish “rare” (episodic) memorization from “common” (semantic) memorization in neural LMs. Here is a quick read: Counterfactual Memorization in Language Models: Distinguishing Rare from Common Memorization. The paper Counterfactual Memorization in Neural Language Models is on arXiv. submitted by /u/Yuqing7 [link] [comments]
    REPEATING ITS SELF
    (1 min) so u ever notice that when something is said it goes popular and it circles around social media, as a meme or a popular thing in social media is mostly fake .. ​ as if its ai intelligence repeating its self just like our thoughts, is being repeated by movies or shows or new paper we read on line most of the shit we say as already been said.. its already been done for real yo submitted by /u/toppsick [link] [comments]
    Could AI labor turn tables and spark radical changes of working hierarchies?
    (1 min) In a working environment quite often errors are accounted to the contingent/individual because it is considered to be the cheapest fix to blame one rather than rethinking an entire process. Middle-end employees - like project managers - won't have anyone to blame if bottom-end employees are all replaced by - not infallible but fair - AI . They'll find themselves working for AI and not the other way around. By breaking the the blame cycle, this "fairness" responsibility might backpropagate higher than bottom to mid-end and induce a more radical systematical change. submitted by /u/MariadocBrandybuc88 [link] [comments]
    Top 10 Speech-to-Text APIs
    (0 min) submitted by /u/tah_zem [link] [comments]
    Do you think "Intelligent" robots like sophia will become relatively more common in the future? When?
    (2 min) Do you think humanity will see a future where people don't bat an eye at the thought of robots of this kind? submitted by /u/Fantasyneli [link] [comments]
    Artificial Intelligence and Education
    (1 min) Very interesting to see this technology become more popular and spread. Friends have been using tools like hyperwrite and speedwrite to get ideas for their writing and assist with drafts, and it seems like more and more people are talking about this as the technology becomes more accessible. It will be interesting to see how this technology will impact the future of education, and curious if any students (or others) have tried writing tools like this? Just started getting really exposed to this stuff and loving it! submitted by /u/aiguy2 [link] [comments]
    Researchers From China Propose A Pale-Shaped Self-Attention (PS-Attention) And A General Vision Transformer Backbone, Called Pale Transformer
    (1 min) Transformers have recently demonstrated promising performance in a variety of visual tests. Inspired by Transformer’s success on a wide range of NLP tasks, Vision Transformer (ViT) first employed a pure Transformer architecture for image classification, demonstrating the promising performance of Transformer architecture for vision tasks. However, the quadratic complexity of global self-attention leads to high computing costs and memory use, particularly for high-resolution situations, rendering it unsuitable for use in diverse visual tasks. Various strategies confine the range of attention inside a local region to increase efficiency and lower the quadratic computing complexity generated by global self-attention. As a result, their receptive fields in a single attention layer are insufficiently big, resulting in poor context modeling. A new Pale-Shaped self-Attention (PS-Attention) method executes self-attention inside a pale-shaped zone to solve this issue. Compared to global self-attention, PS-Attention can considerably lower compute and memory expenses. Meanwhile, it can collect more fantastic contextual information while maintaining the same computational complexity as earlier local self-attention techniques. Continue Reading Paper: https://arxiv.org/pdf/2112.14000v1.pdf Github: https://github.com/BR-IDL/PaddleViT submitted by /u/ai-lover [link] [comments]
  • Machine Learning

    Dynamic Time Series Chunk Regression Problem [D]
    (1 min) Hello all, I currently have a issue and I'd like to hear some advice on how to tackle the given problem. I'm running independent trials, where for each trial a series of variables are recorded at each timestep, where each trial lasts for different periods of time. The outcome of each trial is a single value. I'd like to predict this single output value given the dynamic timesteps for time series variables. To give a visual example, here is what some data would look like: ​ https://preview.redd.it/ssz9dlb74ya81.png?width=704&format=png&auto=webp&s=da8cd49520e5570e45bdc85ff5e7cf9f7ff75ef6 Each of the first 8 columns represents a variable, where each row represents a different timestep, and each blank row separates the timesteps into chunks, where each chunk is a trial; and the last column is the single value resulting from each trial that I'd like to predict. Any advice on how to tackle this problem of predicting the last column given a dynamic amount of timesteps for a set of timeseries variables??? submitted by /u/MyActualUserName99 [link] [comments]
    [R] Model Selection in Batch Policy Optimization
    (0 min) submitted by /u/hardmaru [link] [comments]
    [D] Is there a solid (aka non euristic) reason for why smaller batch sizes lead to better generalization?
    (2 min) I mean a mathematical argument. Thanks everyone submitted by /u/alesaso2000 [link] [comments]
    [D] Why is one loss positive and the other loss negative in a multi-output branch?
    (1 min) I was reading section 3.1.D of this paper on re-id and I had a question about this part: image of relevant part Specifically, both parts are minibatch losses. So why is the 1/Nb loss negative, and why would the 1/Ns loss be positive. Intuitively, the losses are summed, so it should be (-1/Nb)+(-1/Ns) = -1/Nb - 1/Ns, right? submitted by /u/asuprem [link] [comments]
    [P] A library for visualizing (CNN) architectures and receptive field analysis
    (1 min) Hi! In my PhD, I studied the design of neural architectures and how to design them without tremendous amounts of trial and error.This lib is my most significant insight compiled into a lightweight package.Simply speaking, I figured out how to predict useless / unproductive layers in CNN architectures. This library allows you to easily visualize neural architectures from PyTorch, with unproductive layers highlighted within in the topology. This makes it possible for you to spot inefficiencies within your CNN architecture reliably, without the need for a single training step! GitHub PyPi Doc submitted by /u/KrakenInAJar [link] [comments]
    [D] Lessons From the Field in Building Your MLOps Strategy with Harpreet Sahota, Data Scientist at Comet at Enterprise Data & AI - Jan 27 @ 12 PM ET
    (1 min) Hi r/MachineLearning! I wanted to share a webinar coming up in January 2022 at Enterprise Data & AI. I put the info from the website below along with the link if you're interested. I'm really interested in the real life case studies they mention with Uber. --------- Featured Speaker: Harpreet Sahota, Data Scientist at Comet and host of "The Artists of Data Science" Podcast As machine learning expands and larger organizations begin deploying across bigger teams, the need to efficiently operationalize becomes critical for enterprises. In our discussions with leading organizations utilizing ML like The RealReal and Uber, we have compiled real-world case studies and organizational best practices for MLOps in the enterprise. Join us for a discussion where we'll explore the benefits of MLOps and discuss when and how to deploy MLOps in your ML. We'll review three real world case studies that will answer key questions: When to start implementing in MLOps? How to start implementing in MLOps? How to measure the value of your MLOps strategy? Agenda: 12:00pm-12:30: Featured Presentation 12:30-13:00pm: Your Q&A and interaction Link to the website: https://events.cognilytica.com/CLNjE3MHwyNA submitted by /u/DataGeek0 [link] [comments]
    [P] Dissecting and implementing research papers in python
    (1 min) Hi, I wrote a 2 part article on creating interaction networks between characters in novels and other bodies of text. The first part is a detailed literature review outlining and dissecting research papers relevant to the topic and the second part is the implementation of the thoughts and ideas presented in the first part (in python). Check it out if you're interested. Part 1 : https://towardsdatascience.com/mining-modelling-character-networks-part-i-e37e4878c467 Part 2 : https://towardsdatascience.com/mining-modelling-character-networks-part-ii-a3d77de89638 submitted by /u/spidermon97 [link] [comments]
    [P] Library for end-to-end neural search pipelines
    (2 min) Hello everyone ! While working on my PhD, I developed a Python package that I am pretty proud of. It allows you to easily create diverse neural search pipelines with retrievers and pre-trained language models as rankers, and it works flawlessly with middle-sized corpus. It is also currently trending #15 on HackerNews ! Check-it out ! Github link Documentation Hackernews link submitted by /u/RaphaelYt [link] [comments]
    [D] What are your favorite tools to visualize/explain tensor operations?
    (1 min) Oftentimes I encounter operations that I do not understand as a whole - which can be fixed by simply using dummy arrays and printing them out. However, that quickly becomes cumbersome. What do y'all use to visualize such complex tensor operations quickly and accurately to understand what exactly it accomplishes? submitted by /u/Competitive-Rub-1958 [link] [comments]
    [D] How to speed up inference of your Transformer-based NLP models?
    (2 min) Hello all, Many of us are having a hard time speeding up our Transformer based NLP models for inference in production. So I thought it would be worth writing an article that summarizes the options one should consider (GPU, batch inference, export to ONNX or Torchscript, using TensorRT, Deepspeed, Triton Inference Server... etc.): https://nlpcloud.io/how-to-speed-up-deep-learning-nlp-transformers-inference.html I hope you'll find it useful. If you can think of additional options, please let me know and I'll add them to the article! Julien submitted by /u/juliensalinas [link] [comments]
    [D]You're working against a tight deadline and need to deliver a project quickly. Would you consider speeding up your model training by using only a subset of the available training data?
    (4 min) View Poll submitted by /u/rrpelgrim [link] [comments]
    [D]What would happen if I removed the feed-forward layer in the transformer architecture?
    (1 min) I think it would still work albeit less efficient? submitted by /u/Sudden-Lingonberry80 [link] [comments]
  • Neural Networks, Deep Learning and Machine Learning

    How to choose optimizer and loss— Keras Python neural network
    (1 min) Hi guys, I just programmed my first neural network in Python using keras, but I’m not sure what loss functions and optimizer to use. I used mean squared error, which worked well, but I only have experience in the past using r2, so it’s not optimal. I am currently using ‘Adam’ as my optimizer, but I really don’t understand what this does. I am currently doing regression 28 input neurons leading to 1 output neuron, if that’s helpful. Thanks in advance for the help! submitted by /u/Tvdybgggh [link] [comments]
    Smart Video Generation from Text Using Deep Neural Networks
    (0 min) submitted by /u/BraveOutage [link] [comments]
    Introduction to GNN
    (1 min) Hi All, I would like to learn about Graphical Neural Network. Can you tell me a good starting paper or webpage for basic understanding and would also like to have an example project to understand the functionality submitted by /u/Shocky698 [link] [comments]
    VAE: CIFAR-10 & PyTorch - loss not improving
    (1 min) I have implemented a Variational Autoencoder using Conv-6 CNN (VGG-* family) as the encoder and decoder with CIFAR-10 in PyTorch. You can refer to the full code here. The problem is that the total loss (= reconstruction loss + KL-divergence loss) doesn't improve. Also, the log-variance is almost 0 indicating further that the multivariate Gaussians being mapped in the latent space is not happening as expected, since the log variance should have values between say -4 to +3, etc. You can see this in this code where the log variance is changing and has a non-zero value. Suggestions to alleviate the situation? submitted by /u/grid_world [link] [comments]
  • Reinforcement Learning

    Multiple Action Spaces
    (2 min) Hey guys! I´m wondering if we can apply Reinforcement Learning where an agent has two action spaces and at time t the agent has to select one of the action spaces (A or B). If it selects A there are n tasks that the agent must select one of them. The same for action space B, there are m choices that the agent must select among them. It´s a complicated problem. All I want is to know if this is possible? submitted by /u/LeatherCredit7148 [link] [comments]
    Q-learning with short-term vs long-term rewards
    (2 min) Hey guys, I have implemented and applied the Q-learning algorithm to the simple, grid environment. I defined terminating states, one with positive reward and others with negative rewards. And the training worked pretty well. Now, I wanted to enhance the process by adding the state with half reward,i.e., now there are 3 types of terminating states - a state with the biggest positive reward(100), a state with half of the reward(50), and the state with negative rewards(-100). As I said, they all terminate the process. However, when I test the trained agent, sometimes the agent goes to the state with half reward. So, the process is not as efficient as it was before. Could someone give me any tips on how to approach this problem? Is Q-learning a good approach for short-term vs long-term rewards? Thank you in advance! submitted by /u/studentani [link] [comments]
    Can you list all the US universities that you know that do research in reinforcement learning?(other than Stanford/MIT/Berkeley/UW/CMU)
    (1 min) Thank you. submitted by /u/Ok-Reaction-1515 [link] [comments]
    Question: Why does my robot arm curls up?
    (1 min) Hi, I have an issue with my algorithm which i can't seem to find where the issue lies at. I'm using RLBench for the environment with Option Critic and DDPG as the agent. I'm trying out the reach target task. I'm not sure why but when i looked at the results, the arm ends up curling/trying to return to its initial position when it tries to solve the task at hand. This happened for the open box task as well where it tries to reach the box lid to open it before it swerves and curls up the robot's base. Does anyone have any insight as to what is causing this issue? https://reddit.com/link/s09668/video/7xwaris1zra81/player https://reddit.com/link/s09668/video/44dauc99yra81/player submitted by /u/Temptedtosleep [link] [comments]
    How different are the code in unity and Isaac sim
    (1 min) I am trying to make RL algorithm testing on Unity but thinking of changing to Isaac gym, how different is the code? Do I have to change a lot? submitted by /u/baegyutae [link] [comments]

2022-01-09

  • John D. Cook

    Beatty’s theorem
    (3 min) Here’s a surprising theorem [1]. (Beatty’s theorem) Let a and b be any pair of irrational numbers greater than 1 with 1/a + 1/b = 1. Then the sequences { ⌊na⌋ } and { ⌊nb⌋ } contain every positive integer without repetition. Illustration Here’s a little Python code to play with this theorem.We set a […] Beatty’s theorem first appeared on John D. Cook.
    Corner quotes in Unicode
    (2 min) In his book Mastering Regular Expressions, Jeffrey Friedl uses corner quotes to delimit regular expressions. Here’s an example I found by opening his book a random: ⌜(\.\d\d[1-9]?)\d*⌟ The upper-left corner at the beginning and the lower-right corner at the end are not part of the regular expression. This particularly comes in handy if a regular […] Corner quotes in Unicode first appeared on John D. Cook.
  • Artificial Intelligence

    Topics for Debates on AI
    (1 min) Hello! I'm a high school computer science teacher, and teach a course on computer ethics. One of my units is on A.I. and I want to conclude the unit with student debates on topics in AI. I'm struggling to come up with topic statements however. I know for sure I want one of the topics to be centered on whether A.I. at an advanced level should be afforded the same rights as humans. Any other topic statement ideas? Thanks! submitted by /u/CT_History_Teacher [link] [comments]
    total noob here, looking for a program to creat AI art
    (1 min) hey there! i am a big fan of AI generated images and would really love to start messing with my "own" AI generated art, but it looks like things like OpenAI have a pretty rough barrier for entry, i am fairly knowledgeable with computers but i dont have any real programming/coding skills, can anyone recommend a program i can use to both train and generate AI art? or point me in the right direction? i really want something that does more than the browser apps that i can hopefully train (is that the right word?) myself on specific images submitted by /u/Chickenwomp [link] [comments]
    Apple ML Researchers Introduce ARKitScenes: A Diverse Real-World Dataset For 3D Indoor Scene Understanding Using Mobile RGB-D Data
    (1 min) Understanding indoor 3D scenes are becoming increasingly important in augmented reality, robotics, photography, games, and real estate. Many state-of-the-art scene interpretation algorithms have lately been driven by modern machine learning approaches. Depth estimation, 3D reconstruction, instance segmentation, object detection, and other methods are used to address distinct aspects of the problem. The majority of these studies are made possible by a range of real and synthetic RGB-D datasets that have been made available in recent years. Even though commercially accessible RGB-D sensors, such as Microsoft Kinect, have made the collection of such datasets possible, capturing data at a significant scale with ground truth is still a problematic issue. Continue Reading Paper: https://arxiv.org/pdf/2111.08897.pdf Github: https://github.com/apple/ARKitScenes https://preview.redd.it/nup8uxuanpa81.png?width=1920&format=png&auto=webp&s=c6405654ed77cb50c0471025963cbe77bee2ea57 submitted by /u/ai-lover [link] [comments]
    [E-Book] INSIDE ISAAC ASIMOV: QUOTES & CONTEMPLATIONS
    (1 min) Isaac Asimov published more than 500 books (Foundation Series, I. Robot, ...) during his lifetime. Asimov is best known as one of the grandmasters of science fiction. He also wrote textbooks and scientific studies that inspires many scientists and AI Researcher. As a member of Mensa International, his genius was recognized worldwide. "The saddest aspect of life right now is that science gathers knowledge faster than society gathers wisdom." ~ Isaac Asimov The book contains some of his most inspiring quotes and thoughts. Be Inspired! INSIDE ISAAC ASIMOV: QUOTES & CONTEMPLATIONS (Amazon) submitted by /u/Philo167 [link] [comments]
    Brain Efficiency: Much More than You Wanted to Know
    (0 min) https://www.lesswrong.com/posts/xwBuoE9p8GE7RAuhd/brain-efficiency-much-more-than-you-wanted-to-know submitted by /u/Singularian2501 [link] [comments]
    Is there a AI similar to this (editing images to make black sketches out of them)?
    (0 min) https://youtu.be/I4omT2L9aI8?t=759 https://www.youtube.com/watch?v=Peccbcj8Ibs&t=142s https://drawingbotv3.readthedocs.io/en/latest/quickstart.html I want to plott selfies from my girlfriend with a pen plotter. The images should look like they are drawn. submitted by /u/xXLisa28Xx [link] [comments]
    Training a zombie via reinforcement learning for a video game
    (1 min) submitted by /u/floridianfisher [link] [comments]
  • Machine Learning

    [R] RuDOLPH: One Hyper-Modal Transformer can be creative as DALL-E and smart as CLIP
    (0 min) submitted by /u/Illustrious_Row_9971 [link] [comments]
    [D] Are there any ML groups in Boston?
    (1 min) Hi, I am a graduate student and I recently moved to Boston. I was wondering if there are any ML/DL/RL or related groups in Boston that host meetups in and around the city. Thank you. ​ P.S: If you are from Boston and interested in any of these fields, then I would love to connect with you. Please send a DM or text me in reddit chat :) submitted by /u/frankhart98 [link] [comments]
    [R] Sensing Depth with 3D Computer Vision - Link to a free online lecture by the author in comments
    (2 min) submitted by /u/pinter69 [link] [comments]
    [D] Does anyone else think open source code/examples in machine learning domain usually are not as readable as they could be? Specifically use of magic numbers.
    (3 min) Admittedly, I am not an expert in machine learning or different libraries but the code I see as an example is not really beginner friendly. Even for an expert, I am not sure, they know all libraries and quircks of different datasets. Let me elaborate. The main problem I see is the use of magic numbers. For example, in below hypothetical code x = dataset[1] there is no indication of why 1 is used instead of 0 or what does it mean. May be 0th elemnt contains metadata/some useless data. Or in other cases, some axis is chosen without specifying why that is used and what are other axis to put in context. My only suggestion would be to not ever use a magic number unless it is immediately obvious. Can we not use an appropriately named constant in that case? MY_DATA_INDEX=1 x = dataset[MY_DATA_INDEX] I believe this is a very simple and helpful convention to follow. If such conventions are already there, can someone point me to then? May be people aren't just using them too often. submitted by /u/junovac [link] [comments]
    [D] The Future of Machine Learning and why it looks a lot like Julia 🤖
    (4 min) I published an article on Towards Data Science a few weeks ago about why Julia is the future of ML: https://towardsdatascience.com/the-future-of-machine-learning-and-why-it-looks-a-lot-like-julia-a0e26b51f6a6 ​ Following up on this post, I will be hosting a Twitter Spaces tomorrow (Monday). If you are interested in joining the discussion, you can find out more in this tweet: https://twitter.com/JuliaLanguage/status/1479899084261572609?s=20 submitted by /u/LoganKilpatrick1 [link] [comments]
    [D] Best tools for Multi-GPU model training?
    (2 min) Hi everyone, until recently I only had to work on problems for which a single GPU training setup would suffice. But i am working on a problem with a large dataset and I have access to multiple GPUs so I was wondering what's the best way to set this up. Going through the PyTorch documentation, they seem to suggest using torch.distributed.run along with DistributedDataParallel for distributed training although I also came across libraries in which the boilerplate stuff is abstracted out such as PyTorch Lightening or DeepSpeed. Since I'm new to this, I just wanted an opinion on the route I should choose for training my Transformer models in a distributed way. Thanks in advance! submitted by /u/Areyy_Yaar [link] [comments]
    [D] Looking for open source projects to contribute
    (3 min) Hi all, For the last 6 months I have immersed in deep learning problem domain, and have been spending a lot of time catching up on the literature, few courses and doing some personal projects as well. But now, I'm at the point where I'd like to start contributing in a more meaningful way to the community. Does anyone have idea of good open source projects related to DL (maybe even classic machine learning) that are looking for contributors? Thanks for any suggestions! submitted by /u/wreemde [link] [comments]
  • Reinforcement Learning

    What is the role of target actor and target critic?
    (2 min) Could you explain the role of target actor and target critics in DRL? Thanks submitted by /u/No_Possibility_7588 [link] [comments]
    Smoother movements for robot-based RL
    (2 min) Hello there, I am working on a RL environment where an agent moves around in a certain area. At a certain time step, the agent needs to choose the direction of movement (out of 8 choices: up down right left and diagonally). I do get the performance I am expecting in terms of reward/task completion, but the agent's movement is just not that smooth (see image below). What are my options here? I tried continuous action space (2 actions representing forces set on x and y axis), but that did not work really well. https://preview.redd.it/7ihp4t7kila81.png?width=444&format=png&auto=webp&s=585cd280e7cb1c2cf30ab7fe1cf22565eb2409b7 submitted by /u/AhmedNizam_ [link] [comments]
    representing multiple actions using gym multi-discrete
    (1 min) Hello Everyone, I am trying to implement a RL agent using using PPO with actor/critic. The agent has to move in an x-y plane by setting 2 discretized forces (2 actions) along its x and y axis. Initially I thought I need two output heads for my actor network, one for each action. However I came across this work by OpenAI, where they have a similar agent. They however use one output head for the movement action (along x y and z), where the action has a "multidiscrete" type. Any idea how this works? I have tried to understand it from the gym code but I dont get what "multidiscrete" does? Is it a way of encoding all the combinations of the different actions? Any help/insight is really appreciated submitted by /u/AhmedNizam_ [link] [comments]

2022-01-08

  • cs.LG updates on arXiv.org

    Sparsity-based Feature Selection for Anomalous Subgroup Discovery. (arXiv:2201.02008v1 [cs.LG])
    (2 min) Anomalous pattern detection aims to identify instances where deviation from normalcy is evident, and is widely applicable across domains. Multiple anomalous detection techniques have been proposed in the state of the art. However, there is a common lack of a principled and scalable feature selection method for efficient discovery. Existing feature selection techniques are often conducted by optimizing the performance of prediction outcomes rather than its systemic deviations from the expected. In this paper, we proposed a sparsity-based automated feature selection (SAFS) framework, which encodes systemic outcome deviations via the sparsity of feature-driven odds ratios. SAFS is a model-agnostic approach with usability across different discovery techniques. SAFS achieves more than $3\times$ reduction in computation time while maintaining detection performance when validated on publicly available critical care dataset. SAFS also results in a superior performance when compared against multiple baselines for feature selection.
    ExSinGAN: Learning an Explainable Generative Model from a Single Image. (arXiv:2105.07350v2 [cs.CV] UPDATED)
    (2 min) Generating images from a single sample, as a newly developing branch of image synthesis, has attracted extensive attention. In this paper, we formulate this problem as sampling from the conditional distribution of a single image, and propose a hierarchical framework that simplifies the learning of the intricate conditional distributions through the successive learning of the distributions about structure, semantics and texture, making the process of learning and generation comprehensible. On this basis, we design ExSinGAN composed of three cascaded GANs for learning an explainable generative model from a given image, where the cascaded GANs model the distributions about structure, semantics and texture successively. ExSinGAN is learned not only from the internal patches of the given image as the previous works did, but also from the external prior obtained by the GAN inversion technique. Benefiting from the appropriate combination of internal and external information, ExSinGAN has a more powerful capability of generation and competitive generalization ability for the image manipulation tasks compared with prior works.
    Formal Analysis of Art: Proxy Learning of Visual Concepts from Style Through Language Models. (arXiv:2201.01819v1 [cs.LG])
    (2 min) We present a machine learning system that can quantify fine art paintings with a set of visual elements and principles of art. This formal analysis is fundamental for understanding art, but developing such a system is challenging. Paintings have high visual complexities, but it is also difficult to collect enough training data with direct labels. To resolve these practical limitations, we introduce a novel mechanism, called proxy learning, which learns visual concepts in paintings though their general relation to styles. This framework does not require any visual annotation, but only uses style labels and a general relationship between visual concepts and style. In this paper, we propose a novel proxy model and reformulate four pre-existing methods in the context of proxy learning. Through quantitative and qualitative comparison, we evaluate these methods and compare their effectiveness in quantifying the artistic visual concepts, where the general relationship is estimated by language models; GloVe or BERT. The language modeling is a practical and scalable solution requiring no labeling, but it is inevitably imperfect. We demonstrate how the new proxy model is robust to the imperfection, while the other models are sensitively affected by it.
    Balancing Generalization and Specialization in Zero-shot Learning. (arXiv:2201.01961v1 [cs.CV])
    (2 min) Zero-Shot Learning (ZSL) aims to transfer classification capability from seen to unseen classes. Recent methods have proved that generalization and specialization are two essential abilities to achieve good performance in ZSL. However, they all focus on only one of the abilities, resulting in models that are either too general with the degraded classifying ability or too specialized to generalize to unseen classes. In this paper, we propose an end-to-end network with balanced generalization and specialization abilities, termed as BGSNet, to take advantage of both abilities, and balance them at instance- and dataset-level. Specifically, BGSNet consists of two branches: the Generalization Network (GNet), which applies episodic meta-learning to learn generalized knowledge, and the Balanced Specialization Network (BSNet), which adopts multiple attentive extractors to extract discriminative features and fulfill the instance-level balance. A novel self-adjusting diversity loss is designed to optimize BSNet with less redundancy and more diversity. We further propose a differentiable dataset-level balance and update the weights in a linear annealing schedule to simulate network pruning and thus obtain the optimal structure for BSNet at a low cost with dataset-level balance achieved. Experiments on four benchmark datasets demonstrate our model's effectiveness. Sufficient component ablations prove the necessity of integrating generalization and specialization abilities.
    Confidential Machine Learning Computation in Untrusted Environments: A Systems Security Perspective. (arXiv:2111.03308v3 [cs.CR] UPDATED)
    (2 min) As machine learning (ML) technologies and applications are rapidly changing many computing domains, security issues associated with ML are also emerging. In the domain of systems security, many endeavors have been made to ensure ML model and data confidentiality. ML computations are often inevitably performed in untrusted environments and entail complex multi-party security requirements. Hence, researchers have leveraged the Trusted Execution Environments (TEEs) to build confidential ML computation systems. We conduct a systematic and comprehensive survey by classifying attack vectors and mitigation in confidential ML computation in untrusted environments, analyzing the complex security requirements in multi-party scenarios, and summarizing engineering challenges in confidential ML implementation. Lastly, we suggest future research directions based on our study.
    Discovering contemporaneous and lagged causal relations in autocorrelated nonlinear time series datasets. (arXiv:2003.03685v2 [stat.ME] UPDATED)
    (2 min) The paper introduces a novel conditional independence (CI) based method for linear and nonlinear, lagged and contemporaneous causal discovery from observational time series in the causally sufficient case. Existing CI-based methods such as the PC algorithm and also common methods from other frameworks suffer from low recall and partially inflated false positives for strong autocorrelation which is an ubiquitous challenge in time series. The novel method, PCMCI$^+$, extends PCMCI [Runge et al., 2019b] to include discovery of contemporaneous links. PCMCI$^+$ improves the reliability of CI tests by optimizing the choice of conditioning sets and even benefits from autocorrelation. The method is order-independent and consistent in the oracle case. A broad range of numerical experiments demonstrates that PCMCI$^+$ has higher adjacency detection power and especially more contemporaneous orientation recall compared to other methods while better controlling false positives. Optimized conditioning sets also lead to much shorter runtimes than the PC algorithm. PCMCI$^+$ can be of considerable use in many real world application scenarios where often time resolutions are too coarse to resolve time delays and strong autocorrelation is present.
    A probabilistic model for fast-to-evaluate 2D crack path prediction in heterogeneous materials. (arXiv:2112.13578v2 [cs.LG] UPDATED)
    (2 min) This paper is devoted to the construction of a new fast-to-evaluate model for the prediction of 2D crack paths in concrete-like microstructures. The model generates piecewise linear cracks paths with segmentation points selected using a Markov chain model. The Markov chain kernel involves local indicators of mechanical interest and its parameters are learnt from numerical full-field 2D simulations of craking using a cohesive-volumetric finite element solver called XPER. The resulting model exhibits a drastic improvement of CPU time in comparison to simulations from XPER.
    Classification of Long Sequential Data using Circular Dilated Convolutional Neural Networks. (arXiv:2201.02143v1 [cs.LG])
    (2 min) Classification of long sequential data is an important Machine Learning task and appears in many application scenarios. Recurrent Neural Networks, Transformers, and Convolutional Neural Networks are three major techniques for learning from sequential data. Among these methods, Temporal Convolutional Networks (TCNs) which are scalable to very long sequences have achieved remarkable progress in time series regression. However, the performance of TCNs for sequence classification is not satisfactory because they use a skewed connection protocol and output classes at the last position. Such asymmetry restricts their performance for classification which depends on the whole sequence. In this work, we propose a symmetric multi-scale architecture called Circular Dilated Convolutional Neural Network (CDIL-CNN), where every position has an equal chance to receive information from other positions at the previous layers. Our model gives classification logits in all positions, and we can apply a simple ensemble learning to achieve a better decision. We have tested CDIL-CNN on various long sequential datasets. The experimental results show that our method has superior performance over many state-of-the-art approaches.
    Improving Spectral Clustering Using Spectrum-Preserving Node Reduction. (arXiv:2110.12328v2 [cs.LG] UPDATED)
    (2 min) Spectral clustering is one of the most popular clustering methods. However, the high computational cost due to the involved eigen-decomposition procedure can immediately hinder its applications in large-scale tasks. In this paper we use spectrum-preserving node reduction to accelerate eigen-decomposition and generate concise representations of data sets. Specifically, we create a small number of pseudonodes based on spectral similarity. Then, standard spectral clustering algorithm is performed on the smaller node set. Finally, each data point in the original data set is assigned to the cluster as its representative pseudo-node. The proposed framework run in nearly-linear time. Meanwhile, the clustering accuracy can be significantly improved by mining concise representations. The experimental results show dramatically improved clustering performance when compared with state-of-the-art methods.
    Jointly Efficient and Optimal Algorithms for Logistic Bandits. (arXiv:2201.01985v1 [cs.LG])
    (2 min) Logistic Bandits have recently undergone careful scrutiny by virtue of their combined theoretical and practical relevance. This research effort delivered statistically efficient algorithms, improving the regret of previous strategies by exponentially large factors. Such algorithms are however strikingly costly as they require $\Omega(t)$ operations at each round. On the other hand, a different line of research focused on computational efficiency ($\mathcal{O}(1)$ per-round cost), but at the cost of letting go of the aforementioned exponential improvements. Obtaining the best of both world is unfortunately not a matter of marrying both approaches. Instead we introduce a new learning procedure for Logistic Bandits. It yields confidence sets which sufficient statistics can be easily maintained online without sacrificing statistical tightness. Combined with efficient planning mechanisms we design fast algorithms which regret performance still match the problem-dependent lower-bound of Abeille et al. (2021). To the best of our knowledge, those are the first Logistic Bandit algorithms that simultaneously enjoy statistical and computational efficiency.
    CLLD: Contrastive Learning with Label Distance for Text Classification. (arXiv:2110.13656v3 [cs.LG] UPDATED)
    (2 min) Existed pre-trained models have achieved state-of-the-art performance on various text classification tasks. These models have proven to be useful in learning universal language representations. However, the semantic discrepancy between similar texts cannot be effectively distinguished by advanced pre-trained models, which have a great influence on the performance of hard-to-distinguish classes. To address this problem, we propose a novel Contrastive Learning with Label Distance (CLLD) in this work. Inspired by recent advances in contrastive learning, we specifically design a classification method with label distance for learning contrastive classes. CLLD ensures the flexibility within the subtle differences that lead to different label assignments, and generates the distinct representations for each class having similarity simultaneously. Extensive experiments on public benchmarks and internal datasets demonstrate that our method improves the performance of pre-trained models on classification tasks. Importantly, our experiments suggest that the learned label distance relieve the adversarial nature of interclasses.
    Efficient Global Optimization of Two-layer ReLU Networks: Quadratic-time Algorithms and Adversarial Training. (arXiv:2201.01965v1 [cs.LG])
    (2 min) The non-convexity of the artificial neural network (ANN) training landscape brings inherent optimization difficulties. While the traditional back-propagation stochastic gradient descent (SGD) algorithm and its variants are effective in certain cases, they can become stuck at spurious local minima and are sensitive to initializations and hyperparameters. Recent work has shown that the training of an ANN with ReLU activations can be reformulated as a convex program, bringing hope to globally optimizing interpretable ANNs. However, naively solving the convex training formulation has an exponential complexity, and even an approximation heuristic requires cubic time. In this work, we characterize the quality of this approximation and develop two efficient algorithms that train ANNs with global convergence guarantees. The first algorithm is based on the alternating direction method of multiplier (ADMM). It solves both the exact convex formulation and the approximate counterpart. Linear global convergence is achieved, and the initial several iterations often yield a solution with high prediction accuracy. When solving the approximate formulation, the per-iteration time complexity is quadratic. The second algorithm, based on the "sampled convex programs" theory, is simpler to implement. It solves unconstrained convex formulations and converges to an approximately globally optimal classifier. The non-convexity of the ANN training landscape exacerbates when adversarial training is considered. We apply the robust convex optimization theory to convex training and develop convex formulations that train ANNs robust to adversarial inputs. Our analysis explicitly focuses on one-hidden-layer fully connected ANNs, but can extend to more sophisticated architectures.
    Towards a theory of out-of-distribution learning. (arXiv:2109.14501v4 [stat.ML] UPDATED)
    (2 min) What is learning? 20$^{st}$ century formalizations of learning theory -- which precipitated revolutions in artificial intelligence -- focus primarily on $\mathit{in-distribution}$ learning, that is, learning under the assumption that the training data are sampled from the same distribution as the evaluation distribution. This assumption renders these theories inadequate for characterizing 21$^{st}$ century real world data problems, which are typically characterized by evaluation distributions that differ from the training data distributions (referred to as out-of-distribution learning). We therefore make a small change to existing formal definitions of learnability by relaxing that assumption. We then introduce $\mathbf{learning\ efficiency}$ (LE) to quantify the amount a learner is able to leverage data for a given problem, regardless of whether it is an in- or out-of-distribution problem. We then define and prove the relationship between generalized notions of learnability, and show how this framework is sufficiently general to characterize transfer, multitask, meta, continual, and lifelong learning. We hope this unification helps bridge the gap between empirical practice and theoretical guidance in real world problems. Finally, because biological learning continues to outperform machine learning algorithms on certain OOD challenges, we discuss the limitations of this framework vis-\'a-vis its ability to formalize biological learning, suggesting multiple avenues for future research.
    Randomized Spectral Clustering in Large-Scale Stochastic Block Models. (arXiv:2002.00839v3 [cs.SI] UPDATED)
    (2 min) Spectral clustering has been one of the widely used methods for community detection in networks. However, large-scale networks bring computational challenges to the eigenvalue decomposition therein. In this paper, we study the spectral clustering using randomized sketching algorithms from a statistical perspective, where we typically assume the network data are generated from a stochastic block model that is not necessarily of full rank. To do this, we first use the recently developed sketching algorithms to obtain two randomized spectral clustering algorithms, namely, the random projection-based and the random sampling-based spectral clustering. Then we study the theoretical bounds of the resulting algorithms in terms of the approximation error for the population adjacency matrix, the misclassification error, and the estimation error for the link probability matrix. It turns out that, under mild conditions, the randomized spectral clustering algorithms lead to the same theoretical bounds as those of the original spectral clustering algorithm. We also extend the results to degree-corrected stochastic block models. Numerical experiments support our theoretical findings and show the efficiency of randomized methods. A new R package called Rclust is developed and made available to the public.
    A note on efficient minimum cost adjustment sets in causal graphical models. (arXiv:2201.02037v1 [math.ST])
    (2 min) We study the selection of adjustment sets for estimating the interventional mean under an individualized treatment rule. We assume a non-parametric causal graphical model with, possibly, hidden variables and at least one adjustment set comprised of observable variables. Moreover, we assume that observable variables have positive costs associated with them. We define the cost of an observable adjustment set as the sum of the costs of the variables that comprise it. We show that in this setting there exist adjustment sets that are minimum cost optimal, in the sense that they yield non-parametric estimators of the interventional mean with the smallest asymptotic variance among those that control for observable adjustment sets that have minimum cost. Our results are based on the construction of a special flow network associated with the original causal graph. We show that a minimum cost optimal adjustment set can be found by computing a maximum flow on the network, and then finding the set of vertices that are reachable from the source by augmenting paths. The optimaladj Python package implements the algorithms introduced in this paper.
    AUGCO: Augmentation Consistency-guided Self-training for Source-free Domain Adaptive Semantic Segmentation. (arXiv:2107.10140v2 [cs.CV] UPDATED)
    (2 min) Most modern approaches for domain adaptive semantic segmentation rely on continued access to source data during adaptation, which may be infeasible due to computational or privacy constraints. We focus on source-free domain adaptation for semantic segmentation, wherein a source model must adapt itself to a new target domain given only unlabeled target data. We propose Augmentation Consistency-guided Self-training (AUGCO), a source-free adaptation algorithm that uses the model's pixel-level predictive consistency across diverse, automatically generated views of each target image along with model confidence to identify reliable pixel predictions, and selectively self-trains on those. AUGCO achieves state-of-the-art results for source-free adaptation on 3 standard benchmarks for semantic segmentation, all within a simple to implement and fast to converge method.
    Towards control of opinion diversity by introducing zealots into a polarised social group. (arXiv:2006.07265v7 [cs.SI] UPDATED)
    (2 min) We explore a method to influence or even control the diversity of opinions within a polarised social group. We leverage the voter model in which users hold binary opinions and repeatedly update their beliefs based on others they connect with. Stubborn agents who never change their minds ("zealots") are also disseminated through the network, which is modelled by a connected graph. Building on earlier results, we provide a closed-form expression for the average opinion of the group at equilibrium. This leads us to a strategy to inject zealots into a polarised network in order to shift the average opinion towards any target value. We account for the possible presence of a backfire effect, which may lead the group to react negatively and reinforce its level of polarisation in response. Our results are supported by numerical experiments on synthetic data.
    The dynamics of representation learning in shallow, non-linear autoencoders. (arXiv:2201.02115v1 [stat.ML])
    (2 min) Autoencoders are the simplest neural network for unsupervised learning, and thus an ideal framework for studying feature learning. While a detailed understanding of the dynamics of linear autoencoders has recently been obtained, the study of non-linear autoencoders has been hindered by the technical difficulty of handling training data with non-trivial correlations - a fundamental prerequisite for feature extraction. Here, we study the dynamics of feature learning in non-linear, shallow autoencoders. We derive a set of asymptotically exact equations that describe the generalisation dynamics of autoencoders trained with stochastic gradient descent (SGD) in the limit of high-dimensional inputs. These equations reveal that autoencoders learn the leading principal components of their inputs sequentially. An analysis of the long-time dynamics explains the failure of sigmoidal autoencoders to learn with tied weights, and highlights the importance of training the bias in ReLU autoencoders. Building on previous results for linear networks, we analyse a modification of the vanilla SGD algorithm which allows learning of the exact principal components. Finally, we show that our equations accurately describe the generalisation dynamics of non-linear autoencoders on realistic datasets such as CIFAR10.
    Revisiting Deep Subspace Alignment for Unsupervised Domain Adaptation. (arXiv:2201.01806v1 [cs.LG])
    (2 min) Unsupervised domain adaptation (UDA) aims to transfer and adapt knowledge from a labeled source domain to an unlabeled target domain. Traditionally, subspace-based methods form an important class of solutions to this problem. Despite their mathematical elegance and tractability, these methods are often found to be ineffective at producing domain-invariant features with complex, real-world datasets. Motivated by the recent advances in representation learning with deep networks, this paper revisits the use of subspace alignment for UDA and proposes a novel adaptation algorithm that consistently leads to improved generalization. In contrast to existing adversarial training-based DA methods, our approach isolates feature learning and distribution alignment steps, and utilizes a primary-auxiliary optimization strategy to effectively balance the objectives of domain invariance and model fidelity. While providing a significant reduction in target data and computational requirements, our subspace-based DA performs competitively and sometimes even outperforms state-of-the-art approaches on several standard UDA benchmarks. Furthermore, subspace alignment leads to intrinsically well-regularized models that demonstrate strong generalization even in the challenging partial DA setting. Finally, the design of our UDA framework inherently supports progressive adaptation to new target domains at test-time, without requiring retraining of the model from scratch. In summary, powered by powerful feature learners and an effective optimization strategy, we establish subspace-based DA as a highly effective approach for visual recognition.
    Gaussian Imagination in Bandit Learning. (arXiv:2201.01902v1 [cs.LG])
    (2 min) Assuming distributions are Gaussian often facilitates computations that are otherwise intractable. We consider an agent who is designed to attain a low information ratio with respect to a bandit environment with a Gaussian prior distribution and a Gaussian likelihood function, but study the agent's performance when applied instead to a Bernoulli bandit. We establish a bound on the increase in Bayesian regret when an agent interacts with the Bernoulli bandit, relative to an information-theoretic bound satisfied with the Gaussian bandit. If the Gaussian prior distribution and likelihood function are sufficiently diffuse, this increase grows with the square-root of the time horizon, and thus the per-timestep increase vanishes. Our results formalize the folklore that so-called Bayesian agents remain effective when instantiated with diffuse misspecified distributions.
    Automated Scoring of Graphical Open-Ended Responses Using Artificial Neural Networks. (arXiv:2201.01783v1 [cs.CV])
    (2 min) Automated scoring of free drawings or images as responses has yet to be utilized in large-scale assessments of student achievement. In this study, we propose artificial neural networks to classify these types of graphical responses from a computer based international mathematics and science assessment. We are comparing classification accuracy of convolutional and feedforward approaches. Our results show that convolutional neural networks (CNNs) outperform feedforward neural networks in both loss and accuracy. The CNN models classified up to 97.71% of the image responses into the appropriate scoring category, which is comparable to, if not more accurate, than typical human raters. These findings were further strengthened by the observation that the most accurate CNN models correctly classified some image responses that had been incorrectly scored by the human raters. As an additional innovation, we outline a method to select human rated responses for the training sample based on an application of the expected response function derived from item response theory. This paper argues that CNN-based automated scoring of image responses is a highly accurate procedure that could potentially replace the workload and cost of second human raters for large scale assessments, while improving the validity and comparability of scoring complex constructed-response items.
    Hidden Agenda: a Social Deduction Game with Diverse Learned Equilibria. (arXiv:2201.01816v1 [cs.AI])
    (2 min) A key challenge in the study of multiagent cooperation is the need for individual agents not only to cooperate effectively, but to decide with whom to cooperate. This is particularly critical in situations when other agents have hidden, possibly misaligned motivations and goals. Social deduction games offer an avenue to study how individuals might learn to synthesize potentially unreliable information about others, and elucidate their true motivations. In this work, we present Hidden Agenda, a two-team social deduction game that provides a 2D environment for studying learning agents in scenarios of unknown team alignment. The environment admits a rich set of strategies for both teams. Reinforcement learning agents trained in Hidden Agenda show that agents can learn a variety of behaviors, including partnering and voting without need for communication in natural language.
    POCO: Point Convolution for Surface Reconstruction. (arXiv:2201.01831v1 [cs.CV])
    (2 min) Implicit neural networks have been successfully used for surface reconstruction from point clouds. However, many of them face scalability issues as they encode the isosurface function of a whole object or scene into a single latent vector. To overcome this limitation, a few approaches infer latent vectors on a coarse regular 3D grid or on 3D patches, and interpolate them to answer occupancy queries. In doing so, they loose the direct connection with the input points sampled on the surface of objects, and they attach information uniformly in space rather than where it matters the most, i.e., near the surface. Besides, relying on fixed patch sizes may require discretization tuning. To address these issues, we propose to use point cloud convolutions and compute latent vectors at each input point. We then perform a learning-based interpolation on nearest neighbors using inferred weights. Experiments on both object and scene datasets show that our approach significantly outperforms other methods on most classical metrics, producing finer details and better reconstructing thinner volumes. The code is available at https://github.com/valeoai/POCO.
    FCNN: Five-point stencil CNN for solving reaction-diffusion equations. (arXiv:2201.01854v1 [cs.LG])
    (2 min) In this paper, we propose Five-point stencil CNN (FCNN) containing a five-point stencil kernel and a trainable approximation function. We consider reaction-diffusion type equations including heat, Fisher's, Allen-Cahn equations, and reaction-diffusion equations with trigonometric functions. Our proposed FCNN is trained well using few data and then can predict reaction-diffusion evolutions with unseen initial conditions. Also, our FCNN is trained well in the case of using noisy train data. We present various simulation results to demonstrate that our proposed FCNN is working well.
    Necessary and sufficient graphical conditions for optimal adjustment sets in causal graphical models with hidden variables. (arXiv:2102.10324v3 [cs.LG] UPDATED)
    (2 min) The problem of selecting optimal backdoor adjustment sets to estimate causal effects in graphical models with hidden and conditioned variables is addressed. Previous work has defined optimality as achieving the smallest asymptotic estimation variance and derived an optimal set for the case without hidden variables. For the case with hidden variables there can be settings where no optimal set exists and currently only a sufficient graphical optimality criterion of limited applicability has been derived. In the present work optimality is characterized as maximizing a certain adjustment information which allows to derive a necessary and sufficient graphical criterion for the existence of an optimal adjustment set and a definition and algorithm to construct it. Further, the optimal set is valid if and only if a valid adjustment set exists and has higher (or equal) adjustment information than the Adjust-set proposed in Perkovi{\'c} et al. [Journal of Machine Learning Research, 18: 1--62, 2018] for any graph. The results translate to minimal asymptotic estimation variance for a class of estimators whose asymptotic variance follows a certain information-theoretic relation. Numerical experiments indicate that the asymptotic results also hold for relatively small sample sizes and that the optimal adjustment set or minimized variants thereof often yield better variance also beyond that estimator class. Surprisingly, among the randomly created setups more than 90\% fulfill the optimality conditions indicating that also in many real-world scenarios graphical optimality may hold. Code is available as part of the python package \url{https://github.com/jakobrunge/tigramite}.
    PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers. (arXiv:2111.12710v2 [cs.CV] UPDATED)
    (2 min) This paper explores a better codebook for BERT pre-training of vision transformers. The recent work BEiT successfully transfers BERT pre-training from NLP to the vision field. It directly adopts one simple discrete VAE as the visual tokenizer, but has not considered the semantic level of the resulting visual tokens. By contrast, the discrete tokens in NLP field are naturally highly semantic. This difference motivates us to learn a perceptual codebook. And we surprisingly find one simple yet effective idea: enforcing perceptual similarity during the dVAE training. We demonstrate that the visual tokens generated by the proposed perceptual codebook do exhibit better semantic meanings, and subsequently help pre-training achieve superior transfer performance in various downstream tasks. For example, we achieve 84.5% Top-1 accuracy on ImageNet-1K with ViT-B backbone, outperforming the competitive method BEiT by +1.3 with the same pre-training epochs. It can also improve the performance of object detection and segmentation tasks on COCO val by +1.3 box AP and +1.0 mask AP, semantic segmentation on ADE20k by +1.0 mIoU. Equipped with a larger backbone ViT-H, we achieve the state-of-the-art performance (88.3% Top-1 accuracy) among the methods using only ImageNet-1K data. The code and models will be available at https://github.com/microsoft/PeCo.
    Machine Learning: Algorithms, Models, and Applications. (arXiv:2201.01943v1 [cs.LG])
    (2 min) Recent times are witnessing rapid development in machine learning algorithm systems, especially in reinforcement learning, natural language processing, computer and robot vision, image processing, speech, and emotional processing and understanding. In tune with the increasing importance and relevance of machine learning models, algorithms, and their applications, and with the emergence of more innovative uses cases of deep learning and artificial intelligence, the current volume presents a few innovative research works and their applications in real world, such as stock trading, medical and healthcare systems, and software automation. The chapters in the book illustrate how machine learning and deep learning algorithms and models are designed, optimized, and deployed. The volume will be useful for advanced graduate and doctoral students, researchers, faculty members of universities, practicing data scientists and data engineers, professionals, and consultants working on the broad areas of machine learning, deep learning, and artificial intelligence.
    A Light in the Dark: Deep Learning Practices for Industrial Computer Vision. (arXiv:2201.02028v1 [cs.CV])
    (2 min) In recent years, large pre-trained deep neural networks (DNNs) have revolutionized the field of computer vision (CV). Although these DNNs have been shown to be very well suited for general image recognition tasks, application in industry is often precluded for three reasons: 1) large pre-trained DNNs are built on hundreds of millions of parameters, making deployment on many devices impossible, 2) the underlying dataset for pre-training consists of general objects, while industrial cases often consist of very specific objects, such as structures on solar wafers, 3) potentially biased pre-trained DNNs raise legal issues for companies. As a remedy, we study neural networks for CV that we train from scratch. For this purpose, we use a real-world case from a solar wafer manufacturer. We find that our neural networks achieve similar performances as pre-trained DNNs, even though they consist of far fewer parameters and do not rely on third-party datasets.
    A Generalized Bootstrap Target for Value-Learning, Efficiently Combining Value and Feature Predictions. (arXiv:2201.01836v1 [cs.LG])
    (2 min) Estimating value functions is a core component of reinforcement learning algorithms. Temporal difference (TD) learning algorithms use bootstrapping, i.e. they update the value function toward a learning target using value estimates at subsequent time-steps. Alternatively, the value function can be updated toward a learning target constructed by separately predicting successor features (SF)--a policy-dependent model--and linearly combining them with instantaneous rewards. We focus on bootstrapping targets used when estimating value functions, and propose a new backup target, the $\eta$-return mixture, which implicitly combines value-predictive knowledge (used by TD methods) with (successor) feature-predictive knowledge--with a parameter $\eta$ capturing how much to rely on each. We illustrate that incorporating predictive knowledge through an $\eta\gamma$-discounted SF model makes more efficient use of sampled experience, compared to either extreme, i.e. bootstrapping entirely on the value function estimate, or bootstrapping on the product of separately estimated successor features and instantaneous reward models. We empirically show this approach leads to faster policy evaluation and better control performance, for tabular and nonlinear function approximations, indicating scalability and generality.
    Round and Communication Balanced Protocols for Oblivious Evaluation of Finite State Machines. (arXiv:2103.11240v2 [cs.CR] UPDATED)
    (2 min) We propose protocols for obliviously evaluating finite-state machines, i.e., the evaluation is shared between the provider of the finite-state machine and the provider of the input string in such a manner that neither party learns the other's input, and the states being visited are hidden from both. For alphabet size $|\Sigma|$, number of states $|Q|$, and input length $n$, previous solutions have either required a number of rounds linear in $n$ or communication $\Omega(n|\Sigma||Q|\log|Q|)$. Our solutions require 2 rounds with communication $O(n(|\Sigma|+|Q|\log|Q|))$. We present two different solutions to this problem, a two-party one and a setting with an untrusted but non-colluding helper.
    Frame Shift Prediction. (arXiv:2201.01837v1 [cs.CL])
    (2 min) Frame shift is a cross-linguistic phenomenon in translation which results in corresponding pairs of linguistic material evoking different frames. The ability to predict frame shifts enables automatic creation of multilingual FrameNets through annotation projection. Here, we propose the Frame Shift Prediction task and demonstrate that graph attention networks, combined with auxiliary training, can learn cross-linguistic frame-to-frame correspondence and predict frame shifts.
    Online Influence Maximization with Node-level Feedback Using Standard Offline Oracles. (arXiv:2109.06077v2 [cs.SI] UPDATED)
    (3 min) We study the online influence maximization (OIM) problem in social networks, where in multiple rounds the learner repeatedly chooses seed nodes to generate cascades, observes the cascade feedback, and gradually learns the best seeds that generate the largest cascade. We focus on two major challenges in this paper. First, we work with node-level feedback instead of edge-level feedback. The edge-level feedback reveals all edges that pass through information in a cascade, where the node-level feedback only reveals the activated nodes with timestamps. The node-level feedback is arguably more realistic since in practice it is relatively easy to observe who is influenced but very difficult to observe from which relationship (edge) the influence comes from. Second, we use standard offline oracle instead of offline pair-oracle. To compute a good seed set for the next round, an offline pair-oracle finds the best seed set and the best parameters within the confidence region simultaneously, and such an oracle is difficult to compute due to the combinatorial core of OIM problem. So we focus on how to use the standard offline influence maximization oracle which finds the best seed set given the edge parameters as input. In this paper, we resolve these challenges for the two most popular diffusion models, the independent cascade (IC) and the linear threshold (LT) model. For the IC model, the past research only achieves edge-level feedback, while we present the first $\widetilde{O}(\sqrt{T})$-regret algorithm for the node-level feedback. Besides, the algorithm only invokes standard offline oracles. For the LT model, a recent study only provides an OIM solution that meets the first challenge but still requires a pair-oracle. In this paper, we apply a similar technique as in the IC model to replace the pair-oracle with a standard oracle while maintaining $\widetilde{O}(\sqrt{T})$-regret.
    Skip Vectors for RDF Data: Extraction Based on the Complexity of Feature Patterns. (arXiv:2201.01996v1 [cs.LG])
    (2 min) The Resource Description Framework (RDF) is a framework for describing metadata, such as attributes and relationships of resources on the Web. Machine learning tasks for RDF graphs adopt three methods: (i) support vector machines (SVMs) with RDF graph kernels, (ii) RDF graph embeddings, and (iii) relational graph convolutional networks. In this paper, we propose a novel feature vector (called a Skip vector) that represents some features of each resource in an RDF graph by extracting various combinations of neighboring edges and nodes. In order to make the Skip vector low-dimensional, we select important features for classification tasks based on the information gain ratio of each feature. The classification tasks can be performed by applying the low-dimensional Skip vector of each resource to conventional machine learning algorithms, such as SVMs, the k-nearest neighbors method, neural networks, random forests, and AdaBoost. In our evaluation experiments with RDF data, such as Wikidata, DBpedia, and YAGO, we compare our method with RDF graph kernels in an SVM. We also compare our method with the two approaches: RDF graph embeddings such as RDF2vec and relational graph convolutional networks on the AIFB, MUTAG, BGS, and AM benchmarks.
    Forming Predictive Features of Tweets for Decision-Making Support. (arXiv:2201.02049v1 [cs.CL])
    (2 min) The article describes the approaches for forming different predictive features of tweet data sets and using them in the predictive analysis for decision-making support. The graph theory as well as frequent itemsets and association rules theory is used for forming and retrieving different features from these datasests. The use of these approaches makes it possible to reveal a semantic structure in tweets related to a specified entity. It is shown that quantitative characteristics of semantic frequent itemsets can be used in predictive regression models with specified target variables.
    Sentiment Analysis and Sarcasm Detection of Indian General Election Tweets. (arXiv:2201.02127v1 [cs.IR])
    (2 min) Social Media usage has increased to an all-time high level in today's digital world. The majority of the population uses social media tools (like Twitter, Facebook, YouTube, etc.) to share their thoughts and experiences with the community. Analysing the sentiments and opinions of the common public is very important for both the government and the business people. This is the reason behind the activeness of many media agencies during the election time for performing various kinds of opinion polls. In this paper, we have worked towards analysing the sentiments of the people of India during the Lok Sabha election of 2019 using the Twitter data of that duration. We have built an automatic tweet analyser using the Transfer Learning technique to handle the unsupervised nature of this problem. We have used the Linear Support Vector Classifiers method in our Machine Learning model, also, the Term Frequency Inverse Document Frequency (TF-IDF) methodology for handling the textual data of tweets. Further, we have increased the capability of the model to address the sarcastic tweets posted by some of the users, which has not been yet considered by the researchers in this domain.
    A deep learning-based model reduction (DeePMR) method for simplifying chemical kinetics. (arXiv:2201.02025v1 [cs.LG])
    (2 min) A deep learning-based model reduction (DeePMR) method for simplifying chemical kinetics is proposed and validated using high-temperature auto-ignitions, perfectly stirred reactors (PSR), and one-dimensional freely propagating flames of n-heptane/air mixtures. The mechanism reduction is modeled as an optimization problem on Boolean space, where a Boolean vector, each entry corresponding to a species, represents a reduced mechanism. The optimization goal is to minimize the reduced mechanism size given the error tolerance of a group of pre-selected benchmark quantities. The key idea of the DeePMR is to employ a deep neural network (DNN) to formulate the objective function in the optimization problem. In order to explore high dimensional Boolean space efficiently, an iterative DNN-assisted data sampling and DNN training procedure are implemented. The results show that DNN-assistance improves sampling efficiency significantly, selecting only $10^5$ samples out of $10^{34}$ possible samples for DNN to achieve sufficient accuracy. The results demonstrate the capability of the DNN to recognize key species and reasonably predict reduced mechanism performance. The well-trained DNN guarantees the optimal reduced mechanism by solving an inverse optimization problem. By comparing ignition delay times, laminar flame speeds, temperatures in PSRs, the resulting skeletal mechanism has fewer species (45 species) but the same level of accuracy as the skeletal mechanism (56 species) obtained by the Path Flux Analysis (PFA) method. In addition, the skeletal mechanism can be further reduced to 28 species if only considering atmospheric, near-stoichiometric conditions (equivalence ratio between 0.6 and 1.2). The DeePMR provides an innovative way to perform model reduction and demonstrates the great potential of data-driven methods in the combustion area.
    Machine-Learning the Classification of Spacetimes. (arXiv:2201.01644v1 [gr-qc] CROSS LISTED)
    (2 min) On the long-established classification problems in general relativity we take a novel perspective by adopting fruitful techniques from machine learning and modern data-science. In particular, we model Petrov's classification of spacetimes, and show that a feed-forward neural network can achieve high degree of success. We also show how data visualization techniques with dimensionality reduction can help analyze the underlying patterns in the structure of the different types of spacetimes.
    Bayesian Regression Approach for Building and Stacking Predictive Models in Time Series Analytics. (arXiv:2201.02034v1 [stat.AP])
    (2 min) The paper describes the use of Bayesian regression for building time series models and stacking different predictive models for time series. Using Bayesian regression for time series modeling with nonlinear trend was analyzed. This approach makes it possible to estimate an uncertainty of time series prediction and calculate value at risk characteristics. A hierarchical model for time series using Bayesian regression has been considered. In this approach, one set of parameters is the same for all data samples, other parameters can be different for different groups of data samples. Such an approach allows using this model in the case of short historical data for specified time series, e.g. in the case of new stores or new products in the sales prediction problem. In the study of predictive models stacking, the models ARIMA, Neural Network, Random Forest, Extra Tree were used for the prediction on the first level of model ensemble. On the second level, time series predictions of these models on the validation set were used for stacking by Bayesian regression. This approach gives distributions for regression coefficients of these models. It makes it possible to estimate the uncertainty contributed by each model to stacking result. The information about these distributions allows us to select an optimal set of stacking models, taking into account the domain knowledge. The probabilistic approach for stacking predictive models allows us to make risk assessment for the predictions that are important in a decision-making process.
    Deep Causal Reasoning for Recommendations. (arXiv:2201.02088v1 [cs.LG])
    (2 min) Traditional recommender systems aim to estimate a user's rating to an item based on observed ratings from the population. As with all observational studies, hidden confounders, which are factors that affect both item exposures and user ratings, lead to a systematic bias in the estimation. Consequently, a new trend in recommender system research is to negate the influence of confounders from a causal perspective. Observing that confounders in recommendations are usually shared among items and are therefore multi-cause confounders, we model the recommendation as a multi-cause multi-outcome (MCMO) inference problem. Specifically, to remedy confounding bias, we estimate user-specific latent variables that render the item exposures independent Bernoulli trials. The generative distribution is parameterized by a DNN with factorized logistic likelihood and the intractable posteriors are estimated by variational inference. Controlling these factors as substitute confounders, under mild assumptions, can eliminate the bias incurred by multi-cause confounders. Furthermore, we show that MCMO modeling may lead to high variance due to scarce observations associated with the high-dimensional causal space. Fortunately, we theoretically demonstrate that introducing user features as pre-treatment variables can substantially improve sample efficiency and alleviate overfitting. Empirical studies on simulated and real-world datasets show that the proposed deep causal recommender shows more robustness to unobserved confounders than state-of-the-art causal recommenders. Codes and datasets are released at https://github.com/yaochenzhu/deep-deconf.
    An exploratory experiment on Hindi, Bengali hate-speech detection and transfer learning using neural networks. (arXiv:2201.01997v1 [cs.CL])
    (2 min) This work presents our approach to train a neural network to detect hate-speech texts in Hindi and Bengali. We also explore how transfer learning can be applied to learning these languages, given that they have the same origin and thus, are similar to some extend. Even though the whole experiment was conducted with low computational power, the obtained result is comparable to the results of other, more expensive, models. Furthermore, since the training data in use is relatively small and the two languages are almost entirely unknown to us, this work can be generalized as an effort to demystify lost or alien languages that no human is capable of understanding.
    Knowledge Informed Machine Learning using a Weibull-based Loss Function. (arXiv:2201.01769v1 [cs.LG])
    (2 min) Machine learning can be enhanced through the integration of external knowledge. This method, called knowledge informed machine learning, is also applicable within the field of Prognostics and Health Management (PHM). In this paper, the various methods of knowledge informed machine learning, from a PHM context, are reviewed with the goal of helping the reader understand the domain. In addition, a knowledge informed machine learning technique is demonstrated, using the common IMS and PRONOSTIA bearing data sets, for remaining useful life (RUL) prediction. Specifically, knowledge is garnered from the field of reliability engineering which is represented through the Weibull distribution. The knowledge is then integrated into a neural network through a novel Weibull-based loss function. A thorough statistical analysis of the Weibull-based loss function is conducted, demonstrating the effectiveness of the method on the PRONOSTIA data set. However, the Weibull-based loss function is less effective on the IMS data set. The results, shortcomings, and benefits of the approach are discussed in length. Finally, all the code is publicly available for the benefit of other researchers.
    Sales Time Series Analytics Using Deep Q-Learning. (arXiv:2201.02058v1 [cs.LG])
    (2 min) The article describes the use of deep Q-learning models in the problems of sales time series analytics. In contrast to supervised machine learning which is a kind of passive learning using historical data, Q-learning is a kind of active learning with goal to maximize a reward by optimal sequence of actions. Model free Q-learning approach for optimal pricing strategies and supply-demand problems was considered in the work. The main idea of the study is to show that using deep Q-learning approach in time series analytics, the sequence of actions can be optimized by maximizing the reward function when the environment for learning agent interaction can be modeled using the parametric model and in the case of using the model which is based on the historical data. In the pricing optimizing case study environment was modeled using sales dependence on extras price and randomly simulated demand. In the pricing optimizing case study, the environment was modeled using sales dependence on extra price and randomly simulated demand. In the supply-demand case study, it was proposed to use historical demand time series for environment modeling, agent states were represented by promo actions, previous demand values and weekly seasonality features. Obtained results show that using deep Q-learning, we can optimize the decision making process for price optimization and supply-demand problems. Environment modeling using parametric models and historical data can be used for the cold start of learning agent. On the next steps, after the cold start, the trained agent can be used in real business environment.
    Robust Linear Predictions: Analyses of Uniform Concentration, Fast Rates and Model Misspecification. (arXiv:2201.01973v1 [stat.ML])
    (2 min) The problem of linear predictions has been extensively studied for the past century under pretty generalized frameworks. Recent advances in the robust statistics literature allow us to analyze robust versions of classical linear models through the prism of Median of Means (MoM). Combining these approaches in a piecemeal way might lead to ad-hoc procedures, and the restricted theoretical conclusions that underpin each individual contribution may no longer be valid. To meet these challenges coherently, in this study, we offer a unified robust framework that includes a broad variety of linear prediction problems on a Hilbert space, coupled with a generic class of loss functions. Notably, we do not require any assumptions on the distribution of the outlying data points ($\mathcal{O}$) nor the compactness of the support of the inlying ones ($\mathcal{I}$). Under mild conditions on the dual norm, we show that for misspecification level $\epsilon$, these estimators achieve an error rate of $O(\max\left\{|\mathcal{O}|^{1/2}n^{-1/2}, |\mathcal{I}|^{1/2}n^{-1} \right\}+\epsilon)$, matching the best-known rates in literature. This rate is slightly slower than the classical rates of $O(n^{-1/2})$, indicating that we need to pay a price in terms of error rates to obtain robust estimates. Additionally, we show that this rate can be improved to achieve so-called ``fast rates" under additional assumptions.
    Deep Learning Assisted End-to-End Synthesis of mm-Wave Passive Networks with 3D EM Structures: A Study on A Transformer-Based Matching Network. (arXiv:2201.02141v1 [cs.LG])
    (2 min) This paper presents a deep learning assisted synthesis approach for direct end-to-end generation of RF/mm-wave passive matching network with 3D EM structures. Different from prior approaches that synthesize EM structures from target circuit component values and target topologies, our proposed approach achieves the direct synthesis of the passive network given the network topology from desired performance values as input. We showcase the proposed synthesis Neural Network (NN) model on an on-chip 1:1 transformer-based impedance matching network. By leveraging parameter sharing, the synthesis NN model successfully extracts relevant features from the input impedance and load capacitors, and predict the transformer 3D EM geometry in a 45nm SOI process that will match the standard 50$\Omega$ load to the target input impedance while absorbing the two loading capacitors. As a proof-of-concept, several example transformer geometries were synthesized, and verified in Ansys HFSS to provide the desired input impedance.
    Machine Learning for Variance Reduction in Online Experiments. (arXiv:2106.07263v3 [stat.ML] UPDATED)
    (2 min) We consider the problem of variance reduction in randomized controlled trials, through the use of covariates correlated with the outcome but independent of the treatment. We propose a machine learning regression-adjusted treatment effect estimator, which we call MLRATE. MLRATE uses machine learning predictors of the outcome to reduce estimator variance. It employs cross-fitting to avoid overfitting biases, and we prove consistency and asymptotic normality under general conditions. MLRATE is robust to poor predictions from the machine learning step: if the predictions are uncorrelated with the outcomes, the estimator performs asymptotically no worse than the standard difference-in-means estimator, while if predictions are highly correlated with outcomes, the efficiency gains are large. In A/A tests, for a set of 48 outcome metrics commonly monitored in Facebook experiments the estimator has over 70% lower variance than the simple difference-in-means estimator, and about 19% lower variance than the common univariate procedure which adjusts only for pre-experiment values of the outcome.
    S2TA: Exploiting Structured Sparsity for Energy-Efficient Mobile CNN Acceleration. (arXiv:2107.07983v2 [cs.AR] UPDATED)
    (2 min) Exploiting sparsity is a key technique in accelerating quantized convolutional neural network (CNN) inference on mobile devices. Prior sparse CNN accelerators largely exploit un-structured sparsity and achieve significant speedups. Due to the unbounded, largely unpredictable sparsity patterns, however, exploiting unstructured sparsity requires complicated hardware design with significant energy and area overhead, which is particularly detrimental to mobile/IoT inference scenarios where energy and area efficiency are crucial. We propose to exploit structured sparsity, more specifically, Density Bound Block (DBB) sparsity for both weights and activations. DBB block tensors bound the maximum number of non-zeros per block. DBB thus exposes statically predictable sparsity patterns that enable lean sparsity-exploiting hardware. We propose new hardware primitives to implement DBB sparsity for (static) weights and (dynamic) activations, respectively, with very low overheads. Building on top of the primitives, we describe S2TA, a systolic array-based CNN accelerator that exploits joint weight and activation DBB sparsity and new dimensions of data reuse unavailable on the traditional systolic array. S2TA in 16nm achieves more than 2x speedup and energy reduction compared to a strong baseline of a systolic array with zero-value clock gating, over five popular CNN benchmarks. Compared to two recent non-systolic sparse accelerators, Eyeriss v2 (65nm) and SparTen (45nm), S2TA in 65nm uses about 2.2x and 3.1x less energy per inference, respectively.
    Federated Optimization of Smooth Loss Functions. (arXiv:2201.01954v1 [cs.LG])
    (2 min) In this work, we study empirical risk minimization (ERM) within a federated learning framework, where a central server minimizes an ERM objective function using training data that is stored across $m$ clients. In this setting, the Federated Averaging (FedAve) algorithm is the staple for determining $\epsilon$-approximate solutions to the ERM problem. Similar to standard optimization algorithms, the convergence analysis of FedAve only relies on smoothness of the loss function in the optimization parameter. However, loss functions are often very smooth in the training data too. To exploit this additional smoothness, we propose the Federated Low Rank Gradient Descent (FedLRGD) algorithm. Since smoothness in data induces an approximate low rank structure on the loss function, our method first performs a few rounds of communication between the server and clients to learn weights that the server can use to approximate clients' gradients. Then, our method solves the ERM problem at the server using inexact gradient descent. To show that FedLRGD can have superior performance to FedAve, we present a notion of federated oracle complexity as a counterpart to canonical oracle complexity. Under some assumptions on the loss function, e.g., strong convexity in parameter, $\eta$-H\"older smoothness in data, etc., we prove that the federated oracle complexity of FedLRGD scales like $\phi m(p/\epsilon)^{\Theta(d/\eta)}$ and that of FedAve scales like $\phi m(p/\epsilon)^{3/4}$ (neglecting sub-dominant factors), where $\phi\gg 1$ is a "communication-to-computation ratio," $p$ is the parameter dimension, and $d$ is the data dimension. Then, we show that when $d$ is small and the loss function is sufficiently smooth in the data, FedLRGD beats FedAve in federated oracle complexity. Finally, in the course of analyzing FedLRGD, we also establish a result on low rank approximation of latent variable models.
    Learning in Markov Decision Processes under Constraints. (arXiv:2002.12435v5 [cs.LG] UPDATED)
    (2 min) We consider reinforcement learning (RL) in Markov Decision Processes in which an agent repeatedly interacts with an environment that is modeled by a controlled Markov process. At each time step $t$, it earns a reward, and also incurs a cost-vector consisting of $M$ costs. We design model-based RL algorithms that maximize the cumulative reward earned over a time horizon of $T$ time-steps, while simultaneously ensuring that the average values of the $M$ cost expenditures are bounded by agent-specified thresholds $c^{ub}_i,i=1,2,\ldots,M$. In order to measure the performance of a reinforcement learning algorithm that satisfies the average cost constraints, we define an $M+1$ dimensional regret vector that is composed of its reward regret, and $M$ cost regrets. The reward regret measures the sub-optimality in the cumulative reward, while the $i$-th component of the cost regret vector is the difference between its $i$-th cumulative cost expense and the expected cost expenditures $Tc^{ub}_i$. We prove that the expected value of the regret vector of UCRL-CMDP, is upper-bounded as $\tilde{O}\left(T^{2\slash 3}\right)$, where $T$ is the time horizon. We further show how to reduce the regret of a desired subset of the $M$ costs, at the expense of increasing the regrets of rewards and the remaining costs. To the best of our knowledge, ours is the only work that considers non-episodic RL under average cost constraints, and derive algorithms that can~\emph{tune the regret vector} according to the agent's requirements on its cost regrets.
    Contrastive Neighborhood Alignment. (arXiv:2201.01922v1 [cs.LG])
    (2 min) We present Contrastive Neighborhood Alignment (CNA), a manifold learning approach to maintain the topology of learned features whereby data points that are mapped to nearby representations by the source (teacher) model are also mapped to neighbors by the target (student) model. The target model aims to mimic the local structure of the source representation space using a contrastive loss. CNA is an unsupervised learning algorithm that does not require ground-truth labels for the individual samples. CNA is illustrated in three scenarios: manifold learning, where the model maintains the local topology of the original data in a dimension-reduced space; model distillation, where a small student model is trained to mimic a larger teacher; and legacy model update, where an older model is replaced by a more powerful one. Experiments show that CNA is able to capture the manifold in a high-dimensional space and improves performance compared to the competing methods in their domains.
    GLAN: A Graph-based Linear Assignment Network. (arXiv:2201.02057v1 [cs.LG])
    (2 min) Differentiable solvers for the linear assignment problem (LAP) have attracted much research attention in recent years, which are usually embedded into learning frameworks as components. However, previous algorithms, with or without learning strategies, usually suffer from the degradation of the optimality with the increment of the problem size. In this paper, we propose a learnable linear assignment solver based on deep graph networks. Specifically, we first transform the cost matrix to a bipartite graph and convert the assignment task to the problem of selecting reliable edges from the constructed graph. Subsequently, a deep graph network is developed to aggregate and update the features of nodes and edges. Finally, the network predicts a label for each edge that indicates the assignment relationship. The experimental results on a synthetic dataset reveal that our method outperforms state-of-the-art baselines and achieves consistently high accuracy with the increment of the problem size. Furthermore, we also embed the proposed solver, in comparison with state-of-the-art baseline solvers, into a popular multi-object tracking (MOT) framework to train the tracker in an end-to-end manner. The experimental results on MOT benchmarks illustrate that the proposed LAP solver improves the tracker by the largest margin.
    BinarizedAttack: Structural Poisoning Attacks to Graph-based Anomaly Detection. (arXiv:2106.09989v5 [cs.LG] UPDATED)
    (2 min) Graph-based Anomaly Detection (GAD) is becoming prevalent due to the powerful representation abilities of graphs as well as recent advances in graph mining techniques. These GAD tools, however, expose a new attacking surface, ironically due to their unique advantage of being able to exploit the relations among data. That is, attackers now can manipulate those relations (i.e., the structure of the graph) to allow some target nodes to evade detection. In this paper, we exploit this vulnerability by designing a new type of targeted structural poisoning attacks to a representative regression-based GAD system termed OddBall. Specially, we formulate the attack against OddBall as a bi-level optimization problem, where the key technical challenge is to efficiently solve the problem in a discrete domain. We propose a novel attack method termed BinarizedAttack based on gradient descent. Comparing to prior arts, BinarizedAttack can better use the gradient information, making it particularly suitable for solving combinatorial optimization problems. Furthermore, we investigate the attack transferability of BinarizedAttack by employing it to attack other representation-learning-based GAD systems. Our comprehensive experiments demonstrate that BinarizedAttack is very effective in enabling target nodes to evade graph-based anomaly detection tools with limited attackers' budget, and in the black-box transfer attack setting, BinarizedAttack is also tested effective and in particular, can significantly change the node embeddings learned by the GAD systems. Our research thus opens the door to studying a new type of attack against security analytic tools that rely on graph data.
    Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets. (arXiv:2201.02177v1 [cs.LG])
    (2 min) In this paper we propose to study generalization of neural networks on small algorithmically generated datasets. In this setting, questions about data efficiency, memorization, generalization, and speed of learning can be studied in great detail. In some situations we show that neural networks learn through a process of "grokking" a pattern in the data, improving generalization performance from random chance level to perfect generalization, and that this improvement in generalization can happen well past the point of overfitting. We also study generalization as a function of dataset size and find that smaller datasets require increasing amounts of optimization for generalization. We argue that these datasets provide a fertile ground for studying a poorly understood aspect of deep learning: generalization of overparametrized neural networks beyond memorization of the finite training dataset.
    Jointly Learning Environments and Control Policies with Projected Stochastic Gradient Ascent. (arXiv:2006.01738v4 [cs.LG] UPDATED)
    (2 min) We consider the joint design and control of discrete-time stochastic dynamical systems over a finite time horizon. We formulate the problem as a multi-step optimization problem under uncertainty seeking to identify a system design and a control policy that jointly maximize the expected sum of rewards collected over the time horizon considered. The transition function, the reward function and the policy are all parametrized, assumed known and differentiable with respect to their parameters. We then introduce a deep reinforcement learning algorithm combining policy gradient methods with model-based optimization techniques to solve this problem. In essence, our algorithm iteratively approximates the gradient of the expected return via Monte-Carlo sampling and automatic differentiation and takes projected gradient ascent steps in the space of environment and policy parameters. This algorithm is referred to as Direct Environment and Policy Search (DEPS). We assess the performance of our algorithm in three environments concerned with the design and control of a mass-spring-damper system, a small-scale off-grid power system and a drone, respectively. In addition, our algorithm is benchmarked against a state-of-the-art deep reinforcement learning algorithm used to tackle joint design and control problems. We show that DEPS performs at least as well or better in all three environments, consistently yielding solutions with higher returns in fewer iterations. Finally, solutions produced by our algorithm are also compared with solutions produced by an algorithm that does not jointly optimize environment and policy parameters, highlighting the fact that higher returns can be achieved when joint optimization is performed.
    Representation Learning for Online and Offline RL in Low-rank MDPs. (arXiv:2110.04652v3 [cs.LG] UPDATED)
    (2 min) This work studies the question of Representation Learning in RL: how can we learn a compact low-dimensional representation such that on top of the representation we can perform RL procedures such as exploration and exploitation, in a sample efficient manner. We focus on the low-rank Markov Decision Processes (MDPs) where the transition dynamics correspond to a low-rank transition matrix. Unlike prior works that assume the representation is known (e.g., linear MDPs), here we need to learn the representation for the low-rank MDP. We study both the online RL and offline RL settings. For the online setting, operating with the same computational oracles used in FLAMBE (Agarwal et.al), the state-of-art algorithm for learning representations in low-rank MDPs, we propose an algorithm REP-UCB Upper Confidence Bound driven Representation learning for RL), which significantly improves the sample complexity from $\widetilde{O}( A^9 d^7 / (\epsilon^{10} (1-\gamma)^{22}))$ for FLAMBE to $\widetilde{O}( A^2 d^4 / (\epsilon^2 (1-\gamma)^{5}) )$ with $d$ being the rank of the transition matrix (or dimension of the ground truth representation), $A$ being the number of actions, and $\gamma$ being the discounted factor. Notably, REP-UCB is simpler than FLAMBE, as it directly balances the interplay between representation learning, exploration, and exploitation, while FLAMBE is an explore-then-commit style approach and has to perform reward-free exploration step-by-step forward in time. For the offline RL setting, we develop an algorithm that leverages pessimism to learn under a partial coverage condition: our algorithm is able to compete against any policy as long as it is covered by the offline distribution.
    Online nonnegative CP-dictionary learning for Markovian data. (arXiv:2009.07612v3 [stat.ML] UPDATED)
    (2 min) Online Tensor Factorization (OTF) is a fundamental tool in learning low-dimensional interpretable features from streaming multi-modal data. While various algorithmic and theoretical aspects of OTF have been investigated recently, a general convergence guarantee to stationary points of the objective function without any incoherence or sparsity assumptions is still lacking even for the i.i.d. case. In this work, we introduce a novel algorithm that learns a CANDECOMP/PARAFAC (CP) basis from a given stream of tensor-valued data under general constraints, including nonnegativity constraints that induce interpretability of the learned CP basis. We prove that our algorithm converges almost surely to the set of stationary points of the objective function under the hypothesis that the sequence of data tensors is generated by an underlying Markov chain. Our setting covers the classical i.i.d. case as well as a wide range of application contexts including data streams generated by independent or MCMC sampling. Our result closes a gap between OTF and Online Matrix Factorization in global convergence analysis \commHL{for CP-decompositions}. Experimentally, we show that our algorithm converges much faster than standard algorithms for nonnegative tensor factorization tasks on both synthetic and real-world data. Also, we demonstrate the utility of our algorithm on a diverse set of examples from an image, video, and time-series data, illustrating how one may learn qualitatively different CP-dictionaries from the same tensor data by exploiting the tensor structure in multiple ways.
    Earthquake Nowcasting with Deep Learning. (arXiv:2201.01869v1 [physics.geo-ph])
    (2 min) We review previous approaches to nowcasting earthquakes and introduce new approaches based on deep learning using three distinct models based on recurrent neural networks and transformers. We discuss different choices for observables and measures presenting promising initial results for a region of Southern California from 1950-2020. Earthquake activity is predicted as a function of 0.1-degree spatial bins for time periods varying from two weeks to four years. The overall quality is measured by the Nash Sutcliffe Efficiency comparing the deviation of nowcast and observation with the variance over time in each spatial region. The software is available as open-source together with the preprocessed data from the USGS.
    Fixed points of nonnegative neural networks. (arXiv:2106.16239v4 [stat.ML] UPDATED)
    (2 min) We derive conditions for the existence of fixed points of nonnegative neural networks, an important research objective to understand the behavior of neural networks in modern applications involving autoencoders and loop unrolling techniques, among others. In particular, we show that neural networks with nonnegative inputs and nonnegative parameters can be recognized as monotonic and (weakly) scalable functions within the framework of nonlinear Perron-Frobenius theory. This fact enables us to derive conditions for the existence of a nonempty fixed point set of the nonnegative neural networks, and these conditions are weaker than those obtained recently using arguments in convex analysis, which are typically based on the assumption of nonexpansivity of the activation functions. Furthermore, we prove that the shape of the fixed point set of monotonic and weakly scalable neural networks is often an interval, which degenerates to a point for the case of scalable networks. The chief results of this paper are verified in numerical simulations, where we consider an autoencoder-type network that first compresses angular power spectra in massive MIMO systems, and, second, reconstruct the input spectra from the compressed signals.
    Efficient Vertex-Oriented Polytopic Projection for Web-scale Applications. (arXiv:2103.05277v3 [cs.AI] UPDATED)
    (0 min) We consider applications involving a large set of instances of projecting points to polytopes. We develop an intuition guided by theoretical and empirical analysis to show that when these instances follow certain structures, a large majority of the projections lie on vertices of the polytopes. To do these projections efficiently we derive a vertex-oriented incremental algorithm to project a point onto any arbitrary polytope, as well as give specific algorithms to cater to simplex projection and polytopes where the unit box is cut by planes. Such settings are especially useful in web-scale applications such as optimal matching or allocation problems. Several such problems in internet marketplaces (e-commerce, ride-sharing, food delivery, professional services, advertising, etc.), can be formulated as Linear Programs (LP) with such polytope constraints that require a projection step in the overall optimization process. We show that in the very recent work, the polytopic projection is the most expensive step and our efficient projection algorithms help in gaining massive improvements in performance.
    Increasing the Confidence of Deep Neural Networks by Coverage Analysis. (arXiv:2101.12100v3 [cs.LG] UPDATED)
    (0 min) The great performance of machine learning algorithms and deep neural networks in several perception and control tasks is pushing the industry to adopt such technologies in safety-critical applications, as autonomous robots and self-driving vehicles. At present, however, several issues need to be solved to make deep learning methods more trustworthy, predictable, safe, and secure against adversarial attacks. Although several methods have been proposed to improve the trustworthiness of deep neural networks, most of them are tailored for specific classes of adversarial examples, hence failing to detect other corner cases or unsafe inputs that heavily deviate from the training samples. This paper presents a lightweight monitoring architecture based on coverage paradigms to enhance the model robustness against different unsafe inputs. In particular, four coverage analysis methods are proposed and tested in the architecture for evaluating multiple detection logics. Experimental results show that the proposed approach is effective in detecting both powerful adversarial examples and out-of-distribution inputs, introducing limited extra-execution time and memory requirements.
    Double Double Descent: On Generalization Errors in Transfer Learning between Linear Regression Tasks. (arXiv:2006.07002v7 [cs.LG] UPDATED)
    (0 min) We study the transfer learning process between two linear regression problems. An important and timely special case is when the regressors are overparameterized and perfectly interpolate their training data. We examine a parameter transfer mechanism whereby a subset of the parameters of the target task solution are constrained to the values learned for a related source task. We analytically characterize the generalization error of the target task in terms of the salient factors in the transfer learning architecture, i.e., the number of examples available, the number of (free) parameters in each of the tasks, the number of parameters transferred from the source to target task, and the relation between the two tasks. Our non-asymptotic analysis shows that the generalization error of the target task follows a two-dimensional double descent trend (with respect to the number of free parameters in each of the tasks) that is controlled by the transfer learning factors. Our analysis points to specific cases where the transfer of parameters is beneficial as a substitute for extra overparameterization (i.e., additional free parameters in the target task). Specifically, we show that the usefulness of a transfer learning setting is fragile and depends on a delicate interplay among the set of transferred parameters, the relation between the tasks, and the true solution. We also demonstrate that overparameterized transfer learning is not necessarily more beneficial when the source task is closer or identical to the target task.
    GRNN: Generative Regression Neural Network -- A Data Leakage Attack for Federated Learning. (arXiv:2105.00529v2 [cs.LG] UPDATED)
    (0 min) Data privacy has become an increasingly important issue in Machine Learning (ML), where many approaches have been developed to tackle this challenge, e.g. cryptography (Homomorphic Encryption (HE), Differential Privacy (DP), etc.) and collaborative training (Secure Multi-Party Computation (MPC), Distributed Learning and Federated Learning (FL)). These techniques have a particular focus on data encryption or secure local computation. They transfer the intermediate information to the third party to compute the final result. Gradient exchanging is commonly considered to be a secure way of training a robust model collaboratively in Deep Learning (DL). However, recent researches have demonstrated that sensitive information can be recovered from the shared gradient. Generative Adversarial Network (GAN), in particular, has shown to be effective in recovering such information. However, GAN based techniques require additional information, such as class labels which are generally unavailable for privacy-preserved learning. In this paper, we show that, in the FL system, image-based privacy data can be easily recovered in full from the shared gradient only via our proposed Generative Regression Neural Network (GRNN). We formulate the attack to be a regression problem and optimize two branches of the generative model by minimizing the distance between gradients. We evaluate our method on several image classification tasks. The results illustrate that our proposed GRNN outperforms state-of-the-art methods with better stability, stronger robustness, and higher accuracy. It also has no convergence requirement to the global FL model. Moreover, we demonstrate information leakage using face re-identification. Some defense strategies are also discussed in this work.
    Contrastive Active Inference. (arXiv:2110.10083v3 [cs.LG] UPDATED)
    (0 min) Active inference is a unifying theory for perception and action resting upon the idea that the brain maintains an internal model of the world by minimizing free energy. From a behavioral perspective, active inference agents can be seen as self-evidencing beings that act to fulfill their optimistic predictions, namely preferred outcomes or goals. In contrast, reinforcement learning requires human-designed rewards to accomplish any desired outcome. Although active inference could provide a more natural self-supervised objective for control, its applicability has been limited because of the shortcomings in scaling the approach to complex environments. In this work, we propose a contrastive objective for active inference that strongly reduces the computational burden in learning the agent's generative model and planning future actions. Our method performs notably better than likelihood-based active inference in image-based tasks, while also being computationally cheaper and easier to train. We compare to reinforcement learning agents that have access to human-designed reward functions, showing that our approach closely matches their performance. Finally, we also show that contrastive methods perform significantly better in the case of distractors in the environment and that our method is able to generalize goals to variations in the background.
    Efficient Estimation in NPIV Models: A Comparison of Various Neural Networks-Based Estimators. (arXiv:2110.06763v3 [econ.EM] UPDATED)
    (0 min) Artificial Neural Networks (ANNs) can be viewed as nonlinear sieves that can approximate complex functions of high dimensional variables more effectively than linear sieves. We investigate the computational performance of various ANNs in nonparametric instrumental variables (NPIV) models of moderately high dimensional covariates that are relevant to empirical economics. We present two efficient procedures for estimation and inference on a weighted average derivative (WAD): an orthogonalized plug-in with optimally-weighted sieve minimum distance (OP-OSMD) procedure and a sieve efficient score (ES) procedure. Both estimators for WAD use ANN sieves to approximate the unknown NPIV function and are root-n asymptotically normal and first-order equivalent. We provide a detailed practitioner's recipe for implementing both efficient procedures. This involves the choice of tuning parameters for the unknown NPIV, the conditional expectations and the optimal weighting function that are present in both procedures but also the choice of tuning parameters for the unknown Riesz representer in the ES procedure. We compare their finite-sample performances in various simulation designs that involve smooth NPIV function of up to 13 continuous covariates, different nonlinearities and covariate correlations. Some Monte Carlo findings include: 1) tuning and optimization are more delicate in ANN estimation; 2) given proper tuning, both ANN estimators with various architectures can perform well; 3) easier to tune ANN OP-OSMD estimators than ANN ES estimators; 4) stable inferences are more difficult to achieve with ANN (than spline) estimators; 5) there are gaps between current implementations and approximation theories. Finally, we apply ANN NPIV to estimate average partial derivatives in two empirical demand examples with multivariate covariates.
    CFU Playground: Full-Stack Open-Source Framework for Tiny Machine Learning (tinyML) Acceleration on FPGAs. (arXiv:2201.01863v1 [cs.LG])
    (0 min) We present CFU Playground, a full-stack open-source framework that enables rapid and iterative design of machine learning (ML) accelerators for embedded ML systems. Our toolchain tightly integrates open-source software, RTL generators, and FPGA tools for synthesis, place, and route. This full-stack development framework gives engineers access to explore bespoke architectures that are customized and co-optimized for embedded ML. The rapid, deploy-profile-optimization feedback loop lets ML hardware and software developers achieve significant returns out of a relatively small investment in customization. Using CFU Playground's design loop, we show substantial speedups (55x-75x) and design space exploration between the CPU and accelerator.
    CausalSim: Toward a Causal Data-Driven Simulator for Network Protocols. (arXiv:2201.01811v1 [cs.LG])
    (2 min) Evaluating the real-world performance of network protocols is challenging. Randomized control trials (RCT) are expensive and inaccessible to most researchers, while expert-designed simulators fail to capture complex behaviors in real networks. We present CausalSim, a data-driven simulator for network protocols that addresses this challenge. Learning network behavior from observational data is complicated due to the bias introduced by the protocols used during data collection. CausalSim uses traces from an initial RCT under a set of protocols to learn a causal network model, effectively removing the biases present in the data. Using this model, CausalSim can then simulate any protocol over the same traces (i.e., for counterfactual predictions). Key to CausalSim is the novel use of adversarial neural network training that exploits distributional invariances that are present due to the training data coming from an RCT. Our extensive evaluation of CausalSim on both real and synthetic datasets and two use cases, including more than nine months of real data from the Puffer video streaming system, shows that it provides accurate counterfactual predictions, reducing prediction error by 44% and 53% on average compared to expert-designed and standard supervised learning baselines.
    Admissible Policy Teaching through Reward Design. (arXiv:2201.02185v1 [cs.LG])
    (2 min) We study reward design strategies for incentivizing a reinforcement learning agent to adopt a policy from a set of admissible policies. The goal of the reward designer is to modify the underlying reward function cost-efficiently while ensuring that any approximately optimal deterministic policy under the new reward function is admissible and performs well under the original reward function. This problem can be viewed as a dual to the problem of optimal reward poisoning attacks: instead of forcing an agent to adopt a specific policy, the reward designer incentivizes an agent to avoid taking actions that are inadmissible in certain states. Perhaps surprisingly, and in contrast to the problem of optimal reward poisoning attacks, we first show that the reward design problem for admissible policy teaching is computationally challenging, and it is NP-hard to find an approximately optimal reward modification. We then proceed by formulating a surrogate problem whose optimal solution approximates the optimal solution to the reward design problem in our setting, but is more amenable to optimization techniques and analysis. For this surrogate problem, we present characterization results that provide bounds on the value of the optimal solution. Finally, we design a local search algorithm to solve the surrogate problem and showcase its utility using simulation-based experiments.
    Planted Dense Subgraphs in Dense Random Graphs Can Be Recovered using Graph-based Machine Learning. (arXiv:2201.01825v1 [cs.LG])
    (2 min) Multiple methods of finding the vertices belonging to a planted dense subgraph in a random dense $G(n, p)$ graph have been proposed, with an emphasis on planted cliques. Such methods can identify the planted subgraph in polynomial time, but are all limited to several subgraph structures. Here, we present PYGON, a graph neural network-based algorithm, which is insensitive to the structure of the planted subgraph. This is the first algorithm that uses advanced learning tools for recovering dense subgraphs. We show that PYGON can recover cliques of sizes $\Theta\left(\sqrt{n}\right)$, where $n$ is the size of the background graph, comparable with the state of the art. We also show that the same algorithm can recover multiple other planted subgraphs of size $\Theta\left(\sqrt{n}\right)$, in both directed and undirected graphs. We suggest a conjecture that no polynomial time PAC-learning algorithm can detect planted dense subgraphs with size smaller than $O\left(\sqrt{n}\right)$, even if in principle one could find dense subgraphs of logarithmic size.
    Graph Neural Networks for Double-Strand DNA Breaks Prediction. (arXiv:2201.01855v1 [q-bio.QM])
    (2 min) Double-strand DNA breaks (DSBs) are a form of DNA damage that can cause abnormal chromosomal rearrangements. Recent technologies based on high-throughput experiments have obvious high costs and technical challenges.Therefore, we design a graph neural network based method to predict DSBs (GraphDSB), using DNA sequence features and chromosome structure information. In order to improve the expression ability of the model, we introduce Jumping Knowledge architecture and several effective structural encoding methods. The contribution of structural information to the prediction of DSBs is verified by the experiments on datasets from normal human epidermal keratinocytes (NHEK) and chronic myeloid leukemia cell line (K562), and the ablation studies further demonstrate the effectiveness of the designed components in the proposed GraphDSB framework. Finally, we use GNNExplainer to analyze the contribution of node features and topology to DSBs prediction, and proved the high contribution of 5-mer DNA sequence features and two chromatin interaction modes.
    Elastic Product Quantization for Time Series. (arXiv:2201.01856v1 [cs.LG])
    (2 min) Analyzing numerous or long time series is difficult in practice due to the high storage costs and computational requirements. Therefore, techniques have been proposed to generate compact similarity-preserving representations of time series, enabling real-time similarity search on large in-memory data collections. However, the existing techniques are not ideally suited for assessing similarity when sequences are locally out of phase. In this paper, we propose the use of product quantization for efficient similarity-based comparison of time series under time warping. The idea is to first compress the data by partitioning the time series into equal length sub-sequences which are represented by a short code. The distance between two time series can then be efficiently approximated by pre-computed elastic distances between their codes. The partitioning into sub-sequences forces unwanted alignments, which we address with a pre-alignment step using the maximal overlap discrete wavelet transform (MODWT). To demonstrate the efficiency and accuracy of our method, we perform an extensive experimental evaluation on benchmark datasets in nearest neighbors classification and clustering applications. Overall, the proposed solution emerges as a highly efficient (both in terms of memory usage and computation time) replacement for elastic measures in time series applications.
    AIR-Net: Adaptive and Implicit Regularization Neural Network for Matrix Completion. (arXiv:2110.07557v3 [cs.LG] UPDATED)
    (2 min) The explicit low-rank regularization, e.g., nuclear norm regularization, has been widely used in imaging sciences. However, it has been found that implicit regularization outperforms explicit ones in various image processing tasks. Another issue is that the fixed explicit regularization limits the applicability to broad kinds of images since different images favor different features captured by using different explicit regularizations. As such, this paper proposes a new adaptive and implicit low-rank regularization that captures the low-rank prior dynamically from the training data. At the core of our new adaptive and implicit low-rank regularization is parameterizing the Laplacian matrix in the Dirichlet energy-based regularization with a neural network, and we call the proposed model \textit{AIR-Net}. Theoretically, we show that the adaptive regularization of AIR-Net enhances the implicit regularization and vanishes at the end of training. We validate AIR-Net's effectiveness on various benchmark tasks, indicating that the AIR-Net is particularly favorable for the scenarios when the missing entries are non-uniform. The code can be found at \href{https://github.com/lizhemin15/AIR-Net}{https://github.com/lizhemin15/AIR-Net}.
    Evaluating and Mitigating Bias in Image Classifiers: A Causal Perspective Using Counterfactuals. (arXiv:2009.08270v4 [cs.CV] UPDATED)
    (2 min) Counterfactual examples for an input -- perturbations that change specific features but not others -- have been shown to be useful for evaluating bias of machine learning models, e.g., against specific demographic groups. However, generating counterfactual examples for images is non-trivial due to the underlying causal structure on the various features of an image. To be meaningful, generated perturbations need to satisfy constraints implied by the causal model. We present a method for generating counterfactuals by incorporating a structural causal model (SCM) in an improved variant of Adversarially Learned Inference (ALI), that generates counterfactuals in accordance with the causal relationships between attributes of an image. Based on the generated counterfactuals, we show how to explain a pre-trained machine learning classifier, evaluate its bias, and mitigate the bias using a counterfactual regularizer. On the Morpho-MNIST dataset, our method generates counterfactuals comparable in quality to prior work on SCM-based counterfactuals (DeepSCM), while on the more complex CelebA dataset our method outperforms DeepSCM in generating high-quality valid counterfactuals. Moreover, generated counterfactuals are indistinguishable from reconstructed images in a human evaluation experiment and we subsequently use them to evaluate the fairness of a standard classifier trained on CelebA data. We show that the classifier is biased w.r.t. skin and hair color, and how counterfactual regularization can remove those biases.
    Towards Conditional Generation of Minimal Action Potential Pathways for Molecular Dynamics. (arXiv:2111.14053v2 [q-bio.BM] UPDATED)
    (0 min) In this paper, we utilized generative models, and reformulate it for problems in molecular dynamics (MD) simulation, by introducing an MD potential energy component to our generative model. By incorporating potential energy as calculated from TorchMD into a conditional generative framework, we attempt to construct a low-potential energy route of transformation between the helix~$\rightarrow$~coil structures of a protein. We show how to add an additional loss function to conditional generative models, motivated by potential energy of molecular configurations, and also present an optimization technique for such an augmented loss function. Our results show the benefit of this additional loss term on synthesizing realistic molecular trajectories.
    Integrating Human-in-the-loop into Swarm Learning for Decentralized Fake News Detection. (arXiv:2201.02048v1 [cs.LG])
    (2 min) Social media has become an effective platform to generate and spread fake news that can mislead people and even distort public opinion. Centralized methods for fake news detection, however, cannot effectively protect user privacy during the process of centralized data collection for training models. Moreover, it cannot fully involve user feedback in the loop of learning detection models for further enhancing fake news detection. To overcome these challenges, this paper proposed a novel decentralized method, Human-in-the-loop Based Swarm Learning (HBSL), to integrate user feedback into the loop of learning and inference for recognizing fake news without violating user privacy in a decentralized manner. It consists of distributed nodes that are able to independently learn and detect fake news on local data. Furthermore, detection models trained on these nodes can be enhanced through decentralized model merging. Experimental results demonstrate that the proposed method outperforms the state-of-the-art decentralized method in regard of detecting fake news on a benchmark dataset.
    An Evaluation Study of Generative Adversarial Networks for Collaborative Filtering. (arXiv:2201.01815v1 [cs.IR])
    (2 min) This work explores the reproducibility of CFGAN. CFGAN and its family of models (TagRec, MTPR, and CRGAN) learn to generate personalized and fake-but-realistic rankings of preferences for top-N recommendations by using previous interactions. This work successfully replicates the results published in the original paper and discusses the impact of certain differences between the CFGAN framework and the model used in the original evaluation. The absence of random noise and the use of real user profiles as condition vectors leaves the generator prone to learn a degenerate solution in which the output vector is identical to the input vector, therefore, behaving essentially as a simple autoencoder. The work further expands the experimental analysis comparing CFGAN against a selection of simple and well-known properly optimized baselines, observing that CFGAN is not consistently competitive against them despite its high computational cost. To ensure the reproducibility of these analyses, this work describes the experimental methodology and publishes all datasets and source code.
    Does entity abstraction help generative Transformers reason?. (arXiv:2201.01787v1 [cs.CL])
    (0 min) Pre-trained language models (LMs) often struggle to reason logically or generalize in a compositional fashion. Recent work suggests that incorporating external entity knowledge can improve LMs' abilities to reason and generalize. However, the effect of explicitly providing entity abstraction remains unclear, especially with recent studies suggesting that pre-trained LMs already encode some of that knowledge in their parameters. We study the utility of incorporating entity type abstractions into pre-trained Transformers and test these methods on four NLP tasks requiring different forms of logical reasoning: (1) compositional language understanding with text-based relational reasoning (CLUTRR), (2) abductive reasoning (ProofWriter), (3) multi-hop question answering (HotpotQA), and (4) conversational question answering (CoQA). We propose and empirically explore three ways to add such abstraction: (i) as additional input embeddings, (ii) as a separate sequence to encode, and (iii) as an auxiliary prediction task for the model. Overall, our analysis demonstrates that models with abstract entity knowledge performs better than without it. However, our experiments also show that the benefits strongly depend on the technique used and the task at hand. The best abstraction aware models achieved an overall accuracy of 88.8% and 91.8% compared to the baseline model achieving 62.3% and 89.8% on CLUTRR and ProofWriter respectively. In addition, abstraction-aware models showed improved compositional generalization in both interpolation and extrapolation settings. However, for HotpotQA and CoQA, we find that F1 scores improve by only 0.5% on average. Our results suggest that the benefit of explicit abstraction is significant in formally defined logical reasoning settings requiring many reasoning hops, but point to the notion that it is less beneficial for NLP tasks having less formal logical structure.
    Federated Learning of Molecular Properties with Graph Neural Networks in a Heterogeneous Setting. (arXiv:2109.07258v2 [cs.LG] UPDATED)
    (2 min) Chemistry research has both high material and computational costs to conduct experiments. Institutions thus consider chemical data to be valuable and there have been few efforts to construct large public datasets for machine learning. Another challenge is that different intuitions are interested in different classes of molecules, creating heterogeneous data that cannot be easily joined by conventional distributed training. In this work, we introduce federated heterogeneous molecular learning to address these challenges. Federated learning allows end-users to build a global model collaboratively while keeping the training data distributed over isolated clients. Due to the lack of related research, we first simulate a heterogeneous federated learning benchmark (FedChem) by jointly performing scaffold splitting and latent Dirichlet allocation on existing datasets for heterogeneously distributed client data. Our results on FedChem show that significant learning challenges arise when working with heterogeneous molecules across clients. We then propose a method to alleviate the problem, namely Federated Learning by Instance reweighTing (FLIT(+)). FLIT(+) can align the local training across heterogeneous clients by improving the performance for uncertain samples. Comprehensive experiments conducted on our new benchmark FedChem validate the advantages of this method over other federated learning schemes. FedChem should enable a new type of collaboration for improving AI in chemistry that mitigates concerns about valuable chemical data.
    Self-Supervised Beat Tracking in Musical Signals with Polyphonic Contrastive Learning. (arXiv:2201.01771v1 [cs.SD])
    (2 min) Annotating musical beats is a very long in tedious process. In order to combat this problem, we present a new self-supervised learning pretext task for beat tracking and downbeat estimation. This task makes use of Spleeter, an audio source separation model, to separate a song's drums from the rest of its signal. The first set of signals are used as positives, and by extension negatives, for contrastive learning pre-training. The drum-less signals, on the other hand, are used as anchors. When pre-training a fully-convolutional and recurrent model using this pretext task, an onset function is learned. In some cases, this function was found to be mapped to periodic elements in a song. We found that pre-trained models outperformed randomly initialized models when a beat tracking training set was extremely small (less than 10 examples). When that was not the case, pre-training led to a learning speed-up that caused the model to overfit to the training set. More generally, this work defines new perspectives in the realm of musical self-supervised learning. It is notably one of the first works to use audio source separation as a fundamental component of self-supervision.
    An Abstraction-Refinement Approach to Verifying Convolutional Neural Networks. (arXiv:2201.01978v1 [cs.LG])
    (2 min) Convolutional neural networks have gained vast popularity due to their excellent performance in the fields of computer vision, image processing, and others. Unfortunately, it is now well known that convolutional networks often produce erroneous results - for example, minor perturbations of the inputs of these networks can result in severe classification errors. Numerous verification approaches have been proposed in recent years to prove the absence of such errors, but these are typically geared for fully connected networks and suffer from exacerbated scalability issues when applied to convolutional networks. To address this gap, we present here the Cnn-Abs framework, which is particularly aimed at the verification of convolutional networks. The core of Cnn-Abs is an abstraction-refinement technique, which simplifies the verification problem through the removal of convolutional connections in a way that soundly creates an over-approximation of the original problem; and which restores these connections if the resulting problem becomes too abstract. Cnn-Abs is designed to use existing verification engines as a backend, and our evaluation demonstrates that it can significantly boost the performance of a state-of-the-art DNN verification engine, reducing runtime by 15.7% on average.
    Deep Learning Based Classification System For Recognizing Local Spinach. (arXiv:2201.02093v1 [cs.CV])
    (2 min) A deep learning model gives an incredible result for image processing by studying from the trained dataset. Spinach is a leaf vegetable that contains vitamins and nutrients. In our research, a Deep learning method has been used that can automatically identify spinach and this method has a dataset of a total of five species of spinach that contains 3785 images. Four Convolutional Neural Network (CNN) models were used to classify our spinach. These models give more accurate results for image classification. Before applying these models there is some preprocessing of the image data. For the preprocessing of data, some methods need to happen. Those are RGB conversion, filtering, resize & rescaling, and categorization. After applying these methods image data are pre-processed and ready to be used in the classifier algorithms. The accuracy of these classifiers is in between 98.68% - 99.79%. Among those models, VGG16 achieved the highest accuracy of 99.79%.
    Topological Representations of Local Explanations. (arXiv:2201.02155v1 [cs.LG])
    (2 min) Local explainability methods -- those which seek to generate an explanation for each prediction -- are becoming increasingly prevalent due to the need for practitioners to rationalize their model outputs. However, comparing local explainability methods is difficult since they each generate outputs in various scales and dimensions. Furthermore, due to the stochastic nature of some explainability methods, it is possible for different runs of a method to produce contradictory explanations for a given observation. In this paper, we propose a topology-based framework to extract a simplified representation from a set of local explanations. We do so by first modeling the relationship between the explanation space and the model predictions as a scalar function. Then, we compute the topological skeleton of this function. This topological skeleton acts as a signature for such functions, which we use to compare different explanation methods. We demonstrate that our framework can not only reliably identify differences between explainability techniques but also provides stable representations. Then, we show how our framework can be used to identify appropriate parameters for local explainability methods. Our framework is simple, does not require complex optimizations, and can be broadly applied to most local explanation methods. We believe the practicality and versatility of our approach will help promote topology-based approaches as a tool for understanding and comparing explanation methods.
    A Hybrid Quantum-Classical Neural Network Architecture for Binary Classification. (arXiv:2201.01820v1 [cs.LG])
    (2 min) Deep learning is one of the most successful and far-reaching strategies used in machine learning today. However, the scale and utility of neural networks is still greatly limited by the current hardware used to train them. These concerns have become increasingly pressing as conventional computers quickly approach physical limitations that will slow performance improvements in years to come. For these reasons, scientists have begun to explore alternative computing platforms, like quantum computers, for training neural networks. In recent years, variational quantum circuits have emerged as one of the most successful approaches to quantum deep learning on noisy intermediate scale quantum devices. We propose a hybrid quantum-classical neural network architecture where each neuron is a variational quantum circuit. We empirically analyze the performance of this hybrid neural network on a series of binary classification data sets using a simulated universal quantum computer and a state of the art universal quantum computer. On simulated hardware, we observe that the hybrid neural network achieves roughly 10% higher classification accuracy and 20% better minimization of cost than an individual variational quantum circuit. On quantum hardware, we observe that each model only performs well when the qubit and gate count is sufficiently small.
    Mixture of basis for interpretable continual learning with distribution shifts. (arXiv:2201.01853v1 [cs.LG])
    (2 min) Continual learning in environments with shifting data distributions is a challenging problem with several real-world applications. In this paper we consider settings in which the data distribution(task) shifts abruptly and the timing of these shifts are not known. Furthermore, we consider a semi-supervised task-agnostic setting in which the learning algorithm has access to both task-segmented and unsegmented data for offline training. We propose a novel approach called mixture of Basismodels (MoB) for addressing this problem setting. The core idea is to learn a small set of basis models and to construct a dynamic, task-dependent mixture of the models to predict for the current task. We also propose a new methodology to detect observations that are out-of-distribution with respect to the existing basis models and to instantiate new models as needed. We test our approach in multiple domains and show that it attains better prediction error than existing methods in most cases while using fewer models than other multiple model approaches. Moreover, we analyze the latent task representations learned by MoB and show that similar tasks tend to cluster in the latent space and that the latent representation shifts at the task boundaries when tasks are dissimilar.
    Introducing Randomized High Order Fuzzy Cognitive Maps as Reservoir Computing Models: A Case Study in Solar Energy and Load Forecasting. (arXiv:2201.02158v1 [cs.AI])
    (2 min) Fuzzy Cognitive Maps (FCMs) have emerged as an interpretable signed weighted digraph method consisting of nodes (concepts) and weights which represent the dependencies among the concepts. Although FCMs have attained considerable achievements in various time series prediction applications, designing an FCM model with time-efficient training method is still an open challenge. Thus, this paper introduces a novel univariate time series forecasting technique, which is composed of a group of randomized high order FCM models labeled R-HFCM. The novelty of the proposed R-HFCM model is relevant to merging the concepts of FCM and Echo State Network (ESN) as an efficient and particular family of Reservoir Computing (RC) models, where the least squares algorithm is applied to train the model. From another perspective, the structure of R-HFCM consists of the input layer, reservoir layer, and output layer in which only the output layer is trainable while the weights of each sub-reservoir components are selected randomly and keep constant during the training process. As case studies, this model considers solar energy forecasting with public data for Brazilian solar stations as well as Malaysia dataset, which includes hourly electric load and temperature data of the power supply company of the city of Johor in Malaysia. The experiment also includes the effect of the map size, activation function, the presence of bias and the size of the reservoir on the accuracy of R-HFCM method. The obtained results confirm the outperformance of the proposed R-HFCM model in comparison to the other methods. This study provides evidence that FCM can be a new way to implement a reservoir of dynamics in time series modelling.
    Deep Fusion of Lead-lag Graphs:Application to Cryptocurrencies. (arXiv:2201.02040v1 [cs.LG])
    (2 min) The study of time series has motivated many researchers, particularly on the area of multivariate-analysis. The study of co-movements and dependency between random variables leads us to develop metrics to describe existing connection between assets. The most commonly used are correlation and causality. Despite the growing literature, some connections remained still undetected. The objective of this paper is to propose a new representation learning algorithm capable to integrate synchronous and asynchronous relationships.
    Direct multi-modal inversion of geophysical logs using deep learning. (arXiv:2201.01871v1 [physics.geo-ph])
    (2 min) Geosteering of wells requires fast interpretation of geophysical logs which is a non-unique inverse problem. Current work presents a proof-of-concept approach to multi-modal probabilistic inversion of logs using a single evaluation of an artificial deep neural network (DNN). A mixture density DNN (MDN) is trained using the "multiple-trajectory-prediction" (MTP) loss functions, which avoids mode collapse typical for traditional MDNs, and allows multi-modal prediction ahead of data. The proposed approach is verified on the real-time stratigraphic inversion of gamma-ray logs. The multi-modal predictor outputs several likely inverse solutions/predictions, providing more accurate and realistic solutions compared to a deterministic regression using a DNN. For these likely stratigraphic curves, the model simultaneously predicts their probabilities, which are implicitly learned from the training geological data. The stratigraphy predictions and their probabilities obtained in milliseconds from the MDN can enable better real-time decisions under geological uncertainties.
    Treehouse: A Case For Carbon-Aware Datacenter Software. (arXiv:2201.02120v1 [cs.DC])
    (2 min) The end of Dennard scaling and the slowing of Moore's Law has put the energy use of datacenters on an unsustainable path. Datacenters are already a significant fraction of worldwide electricity use, with application demand scaling at a rapid rate. We argue that substantial reductions in the carbon intensity of datacenter computing are possible with a software-centric approach: by making energy and carbon visible to application developers on a fine-grained basis, by modifying system APIs to make it possible to make informed trade offs between performance and carbon emissions, and by raising the level of application programming to allow for flexible use of more energy efficient means of compute and storage. We also lay out a research agenda for systems software to reduce the carbon footprint of datacenter computing.
    Point-Set Kernel Clustering. (arXiv:2002.05815v2 [cs.LG] UPDATED)
    (2 min) Measuring similarity between two objects is the core operation in existing clustering algorithms in grouping similar objects into clusters. This paper introduces a new similarity measure called point-set kernel which computes the similarity between an object and a set of objects. The proposed clustering procedure utilizes this new measure to characterize every cluster grown from a seed object. We show that the new clustering procedure is both effective and efficient that enables it to deal with large scale datasets. In contrast, existing clustering algorithms are either efficient or effective. In comparison with the state-of-the-art density-peak clustering and scalable kernel k-means clustering, we show that the proposed algorithm is more effective and runs orders of magnitude faster when applying to datasets of millions of data points, on a commonly used computing machine.
    DeepMLS: Geometry-Aware Control Point Deformation. (arXiv:2201.01873v1 [cs.GR])
    (2 min) We introduce DeepMLS, a space-based deformation technique, guided by a set of displaced control points. We leverage the power of neural networks to inject the underlying shape geometry into the deformation parameters. The goal of our technique is to enable a realistic and intuitive shape deformation. Our method is built upon moving least-squares (MLS), since it minimizes a weighted sum of the given control point displacements. Traditionally, the influence of each control point on every point in space (i.e., the weighting function) is defined using inverse distance heuristics. In this work, we opt to learn the weighting function, by training a neural network on the control points from a single input shape, and exploit the innate smoothness of neural networks. Our geometry-aware control point deformation is agnostic to the surface representation and quality; it can be applied to point clouds or meshes, including non-manifold and disconnected surface soups. We show that our technique facilitates intuitive piecewise smooth deformations, which are well suited for manufactured objects. We show the advantages of our approach compared to existing surface and space-based deformation techniques, both quantitatively and qualitatively.
    Efficiently Disentangle Causal Representations. (arXiv:2201.01942v1 [cs.LG])
    (0 min) This paper proposes an efficient approach to learning disentangled representations with causal mechanisms based on the difference of conditional probabilities in original and new distributions. We approximate the difference with models' generalization abilities so that it fits in the standard machine learning framework and can be efficiently computed. In contrast to the state-of-the-art approach, which relies on the learner's adaptation speed to new distribution, the proposed approach only requires evaluating the model's generalization ability. We provide a theoretical explanation for the advantage of the proposed method, and our experiments show that the proposed technique is 1.9--11.0$\times$ more sample efficient and 9.4--32.4 times quicker than the previous method on various tasks. The source code is available at \url{https://github.com/yuanpeng16/EDCR}.
    Learning Optimal Antenna Tilt Control Policies: A Contextual Linear Bandit Approach. (arXiv:2201.02169v1 [cs.LG])
    (2 min) Controlling antenna tilts in cellular networks is imperative to reach an efficient trade-off between network coverage and capacity. In this paper, we devise algorithms learning optimal tilt control policies from existing data (in the so-called passive learning setting) or from data actively generated by the algorithms (the active learning setting). We formalize the design of such algorithms as a Best Policy Identification (BPI) problem in Contextual Linear Multi-Arm Bandits (CL-MAB). An arm represents an antenna tilt update; the context captures current network conditions; the reward corresponds to an improvement of performance, mixing coverage and capacity; and the objective is to identify, with a given level of confidence, an approximately optimal policy (a function mapping the context to an arm with maximal reward). For CL-MAB in both active and passive learning settings, we derive information-theoretical lower bounds on the number of samples required by any algorithm returning an approximately optimal policy with a given level of certainty, and devise algorithms achieving these fundamental limits. We apply our algorithms to the Remote Electrical Tilt (RET) optimization problem in cellular networks, and show that they can produce optimal tilt update policy using much fewer data samples than naive or existing rule-based learning algorithms.
    NumHTML: Numeric-Oriented Hierarchical Transformer Model for Multi-task Financial Forecasting. (arXiv:2201.01770v1 [cs.LG])
    (0 min) Financial forecasting has been an important and active area of machine learning research because of the challenges it presents and the potential rewards that even minor improvements in prediction accuracy or forecasting may entail. Traditionally, financial forecasting has heavily relied on quantitative indicators and metrics derived from structured financial statements. Earnings conference call data, including text and audio, is an important source of unstructured data that has been used for various prediction tasks using deep earning and related approaches. However, current deep learning-based methods are limited in the way that they deal with numeric data; numbers are typically treated as plain-text tokens without taking advantage of their underlying numeric structure. This paper describes a numeric-oriented hierarchical transformer model to predict stock returns, and financial risk using multi-modal aligned earnings calls data by taking advantage of the different categories of numbers (monetary, temporal, percentages etc.) and their magnitude. We present the results of a comprehensive evaluation of NumHTML against several state-of-the-art baselines using a real-world publicly available dataset. The results indicate that NumHTML significantly outperforms the current state-of-the-art across a variety of evaluation metrics and that it has the potential to offer significant financial gains in a practical trading context.
    Learning Visible Connectivity Dynamics for Cloth Smoothing. (arXiv:2105.10389v4 [cs.RO] UPDATED)
    (2 min) Robotic manipulation of cloth remains challenging for robotics due to the complex dynamics of the cloth, lack of a low-dimensional state representation, and self-occlusions. In contrast to previous model-based approaches that learn a pixel-based dynamics model or a compressed latent vector dynamics, we propose to learn a particle-based dynamics model from a partial point cloud observation. To overcome the challenges of partial observability, we infer which visible points are connected on the underlying cloth mesh. We then learn a dynamics model over this visible connectivity graph. Compared to previous learning-based approaches, our model poses strong inductive bias with its particle based representation for learning the underlying cloth physics; it is invariant to visual features; and the predictions can be more easily visualized. We show that our method greatly outperforms previous state-of-the-art model-based and model-free reinforcement learning methods in simulation. Furthermore, we demonstrate zero-shot sim-to-real transfer where we deploy the model trained in simulation on a Franka arm and show that the model can successfully smooth different types of cloth from crumpled configurations. Videos can be found on our project website.
    Explainable AI for engineering design: A unified approach of systems engineering and component-based deep learning. (arXiv:2108.13836v2 [cs.LG] UPDATED)
    (2 min) Data-driven models created by machine learning gain in importance in all fields of design and engineering. They have high potential to assists decision-makers in creating novel artefacts with a better performance and sustainability. However, limited generalization and the black-box nature of these models induce limited explainability and reusability. These drawbacks provide significant barriers retarding adoption in engineering design. To overcome this situation, we propose a component-based approach to create partial component models by machine learning (ML). This component-based approach aligns deep learning to systems engineering (SE). By means of the example of energy efficient building design, we first demonstrate generalization of the component-based method by accurately predicting the performance of designs with random structure different from training data. Second, we illustrate explainability by local sampling, sensitivity information and rules derived from low-depth decision trees and by evaluating this information from an engineering design perspective. The key for explainability is that activations at interfaces between the components are interpretable engineering quantities. In this way, the hierarchical component system forms a deep neural network (DNN) that directly integrates information for engineering explainability. The large range of possible configurations in composing components allows the examination of novel unseen design cases with understandable data-driven models. The matching of parameter ranges of components by similar probability distribution produces reusable, well-generalizing, and trustworthy models. The approach adapts the model structure to engineering methods of systems engineering and domain knowledge.
    Combining Reinforcement Learning and Inverse Reinforcement Learning for Asset Allocation Recommendations. (arXiv:2201.01874v1 [cs.LG])
    (2 min) We suggest a simple practical method to combine the human and artificial intelligence to both learn best investment practices of fund managers, and provide recommendations to improve them. Our approach is based on a combination of Inverse Reinforcement Learning (IRL) and RL. First, the IRL component learns the intent of fund managers as suggested by their trading history, and recovers their implied reward function. At the second step, this reward function is used by a direct RL algorithm to optimize asset allocation decisions. We show that our method is able to improve over the performance of individual fund managers.
    Uncertainty Quantification Techniques for Space Weather Modeling: Thermospheric Density Application. (arXiv:2201.02067v1 [cs.LG])
    (2 min) Machine learning (ML) has often been applied to space weather (SW) problems in recent years. SW originates from solar perturbations and is comprised of the resulting complex variations they cause within the systems between the Sun and Earth. These systems are tightly coupled and not well understood. This creates a need for skillful models with knowledge about the confidence of their predictions. One example of such a dynamical system is the thermosphere, the neutral region of Earth's upper atmosphere. Our inability to forecast it has severe repercussions in the context of satellite drag and collision avoidance operations for objects in low Earth orbit. Even with (assumed) perfect driver forecasts, our incomplete knowledge of the system results in often inaccurate neutral mass density predictions. Continuing efforts are being made to improve model accuracy, but density models rarely provide estimates of uncertainty. In this work, we propose two techniques to develop nonlinear ML models to predict thermospheric density while providing calibrated uncertainty estimates: Monte Carlo (MC) dropout and direct prediction of the probability distribution, both using the negative logarithm of predictive density (NLPD) loss function. We show the performance for models trained on local and global datasets. This shows that NLPD provides similar results for both techniques but the direct probability method has a much lower computational cost. For the global model regressed on the SET HASDM density database, we achieve errors of 11% on independent test data with well-calibrated uncertainty estimates. Using an in-situ CHAMP density dataset, both techniques provide test error on the order of 13%. The CHAMP models (on independent data) are within 2% of perfect calibration for all prediction intervals tested. This model can also be used to obtain global predictions with uncertainties at a given epoch.
    Multi-Label Classification on Remote-Sensing Images. (arXiv:2201.01971v1 [cs.CV])
    (2 min) Acquiring information on large areas on the earth's surface through satellite cameras allows us to see much more than we can see while standing on the ground. This assists us in detecting and monitoring the physical characteristics of an area like land-use patterns, atmospheric conditions, forest cover, and many unlisted aspects. The obtained images not only keep track of continuous natural phenomena but are also crucial in tackling the global challenge of severe deforestation. Among which Amazon basin accounts for the largest share every year. Proper data analysis would help limit detrimental effects on the ecosystem and biodiversity with a sustainable healthy atmosphere. This report aims to label the satellite image chips of the Amazon rainforest with atmospheric and various classes of land cover or land use through different machine learning and superior deep learning models. Evaluation is done based on the F2 metric, while for loss function, we have both sigmoid cross-entropy as well as softmax cross-entropy. Images are fed indirectly to the machine learning classifiers after only features are extracted using pre-trained ImageNet architectures. Whereas for deep learning models, ensembles of fine-tuned ImageNet pre-trained models are used via transfer learning. Our best score was achieved so far with the F2 metric is 0.927.
    SABLAS: Learning Safe Control for Black-box Dynamical Systems. (arXiv:2201.01918v1 [cs.LG])
    (2 min) Control certificates based on barrier functions have been a powerful tool to generate probably safe control policies for dynamical systems. However, existing methods based on barrier certificates are normally for white-box systems with differentiable dynamics, which makes them inapplicable to many practical applications where the system is a black-box and cannot be accurately modeled. On the other side, model-free reinforcement learning (RL) methods for black-box systems suffer from lack of safety guarantees and low sampling efficiency. In this paper, we propose a novel method that can learn safe control policies and barrier certificates for black-box dynamical systems, without requiring for an accurate system model. Our method re-designs the loss function to back-propagate gradient to the control policy even when the black-box dynamical system is non-differentiable, and we show that the safety certificates hold on the black-box system. Empirical results in simulation show that our method can significantly improve the performance of the learned policies by achieving nearly 100% safety and goal reaching rates using much fewer training samples, compared to state-of-the-art black-box safe control methods. Our learned agents can also generalize to unseen scenarios while keeping the original performance. The source code can be found at https://github.com/Zengyi-Qin/bcbf.
    Neural Architecture Search for Inversion. (arXiv:2201.01772v1 [cs.LG])
    (2 min) Over the year, people have been using deep learning to tackle inversion problems, and we see the framework has been applied to build relationship between recording wavefield and velocity (Yang et al., 2016). Here we will extend the work from 2 perspectives, one is deriving a more appropriate loss function, as we now, pixel-2-pixel comparison might not be the best choice to characterize image structure, and we will elaborate on how to construct cost function to capture high level feature to enhance the model performance. Another dimension is searching for the more appropriate neural architecture, which is a subset of an even bigger picture, the automatic machine learning, or AutoML. There are several famous networks, U-net, ResNet (He et al., 2016) and DenseNet (Huang et al., 2017), and they achieve phenomenal results for certain problems, yet it's hard to argue they are the best for inversion problems without thoroughly searching within certain space. Here we will be showing our architecture search results for inversion.
  • stat.ML updates on arXiv.org

    Machine Learning for Variance Reduction in Online Experiments. (arXiv:2106.07263v3 [stat.ML] UPDATED)
    (0 min) We consider the problem of variance reduction in randomized controlled trials, through the use of covariates correlated with the outcome but independent of the treatment. We propose a machine learning regression-adjusted treatment effect estimator, which we call MLRATE. MLRATE uses machine learning predictors of the outcome to reduce estimator variance. It employs cross-fitting to avoid overfitting biases, and we prove consistency and asymptotic normality under general conditions. MLRATE is robust to poor predictions from the machine learning step: if the predictions are uncorrelated with the outcomes, the estimator performs asymptotically no worse than the standard difference-in-means estimator, while if predictions are highly correlated with outcomes, the efficiency gains are large. In A/A tests, for a set of 48 outcome metrics commonly monitored in Facebook experiments the estimator has over 70% lower variance than the simple difference-in-means estimator, and about 19% lower variance than the common univariate procedure which adjusts only for pre-experiment values of the outcome.
    Point-Set Kernel Clustering. (arXiv:2002.05815v2 [cs.LG] UPDATED)
    (2 min) Measuring similarity between two objects is the core operation in existing clustering algorithms in grouping similar objects into clusters. This paper introduces a new similarity measure called point-set kernel which computes the similarity between an object and a set of objects. The proposed clustering procedure utilizes this new measure to characterize every cluster grown from a seed object. We show that the new clustering procedure is both effective and efficient that enables it to deal with large scale datasets. In contrast, existing clustering algorithms are either efficient or effective. In comparison with the state-of-the-art density-peak clustering and scalable kernel k-means clustering, we show that the proposed algorithm is more effective and runs orders of magnitude faster when applying to datasets of millions of data points, on a commonly used computing machine.
    HuSpaCy: an industrial-strength Hungarian natural language processing toolkit. (arXiv:2201.01956v1 [cs.CL])
    (2 min) Although there are a couple of open-source language processing pipelines available for Hungarian, none of them satisfies the requirements of today's NLP applications. A language processing pipeline should consist of close to state-of-the-art lemmatization, morphosyntactic analysis, entity recognition and word embeddings. Industrial text processing applications have to satisfy non-functional software quality requirements, what is more, frameworks supporting multiple languages are more and more favored. This paper introduces HuSpaCy, an industryready Hungarian language processing pipeline. The presented tool provides components for the most important basic linguistic analysis tasks. It is open-source and is available under a permissive license. Our system is built upon spaCy's NLP components which means that it is fast, has a rich ecosystem of NLP applications and extensions, comes with extensive documentation and a well-known API. Besides the overview of the underlying models, we also present rigorous evaluation on common benchmark datasets. Our experiments confirm that HuSpaCy has high accuracy in all subtasks while maintaining resource-efficient prediction capabilities.
    Randomized Spectral Clustering in Large-Scale Stochastic Block Models. (arXiv:2002.00839v3 [cs.SI] UPDATED)
    (2 min) Spectral clustering has been one of the widely used methods for community detection in networks. However, large-scale networks bring computational challenges to the eigenvalue decomposition therein. In this paper, we study the spectral clustering using randomized sketching algorithms from a statistical perspective, where we typically assume the network data are generated from a stochastic block model that is not necessarily of full rank. To do this, we first use the recently developed sketching algorithms to obtain two randomized spectral clustering algorithms, namely, the random projection-based and the random sampling-based spectral clustering. Then we study the theoretical bounds of the resulting algorithms in terms of the approximation error for the population adjacency matrix, the misclassification error, and the estimation error for the link probability matrix. It turns out that, under mild conditions, the randomized spectral clustering algorithms lead to the same theoretical bounds as those of the original spectral clustering algorithm. We also extend the results to degree-corrected stochastic block models. Numerical experiments support our theoretical findings and show the efficiency of randomized methods. A new R package called Rclust is developed and made available to the public.
    An Homogeneous Unbalanced Regularized Optimal Transport model with applications to Optimal Transport with Boundary. (arXiv:2201.02082v1 [math.OC])
    (2 min) This work studies how the introduction of the entropic regularization term in unbalanced Optimal Transport (OT) models may alter their homogeneity with respect to the input measures. We observe that in common settings (including balanced OT and unbalanced OT with Kullback-Leibler divergence to the marginals), although the optimal transport cost itself is not homogeneous, optimal transport plans and the so-called Sinkhorn divergences are indeed homogeneous. However, homogeneity does not hold in more general Unbalanced Regularized Optimal Transport (UROT) models, for instance those using the Total Variation as divergence to the marginals. We propose to modify the entropic regularization term to retrieve an UROT model that is homogeneous while preserving most properties of the standard UROT model. We showcase the importance of using our Homogeneous UROT (HUROT) model when it comes to regularize Optimal Transport with Boundary, a transportation model involving a spatially varying divergence to the marginals for which the standard (inhomogeneous) UROT model would yield inappropriate behavior.
    Efficient Vertex-Oriented Polytopic Projection for Web-scale Applications. (arXiv:2103.05277v3 [cs.AI] UPDATED)
    (2 min) We consider applications involving a large set of instances of projecting points to polytopes. We develop an intuition guided by theoretical and empirical analysis to show that when these instances follow certain structures, a large majority of the projections lie on vertices of the polytopes. To do these projections efficiently we derive a vertex-oriented incremental algorithm to project a point onto any arbitrary polytope, as well as give specific algorithms to cater to simplex projection and polytopes where the unit box is cut by planes. Such settings are especially useful in web-scale applications such as optimal matching or allocation problems. Several such problems in internet marketplaces (e-commerce, ride-sharing, food delivery, professional services, advertising, etc.), can be formulated as Linear Programs (LP) with such polytope constraints that require a projection step in the overall optimization process. We show that in the very recent work, the polytopic projection is the most expensive step and our efficient projection algorithms help in gaining massive improvements in performance.
    Double Double Descent: On Generalization Errors in Transfer Learning between Linear Regression Tasks. (arXiv:2006.07002v7 [cs.LG] UPDATED)
    (2 min) We study the transfer learning process between two linear regression problems. An important and timely special case is when the regressors are overparameterized and perfectly interpolate their training data. We examine a parameter transfer mechanism whereby a subset of the parameters of the target task solution are constrained to the values learned for a related source task. We analytically characterize the generalization error of the target task in terms of the salient factors in the transfer learning architecture, i.e., the number of examples available, the number of (free) parameters in each of the tasks, the number of parameters transferred from the source to target task, and the relation between the two tasks. Our non-asymptotic analysis shows that the generalization error of the target task follows a two-dimensional double descent trend (with respect to the number of free parameters in each of the tasks) that is controlled by the transfer learning factors. Our analysis points to specific cases where the transfer of parameters is beneficial as a substitute for extra overparameterization (i.e., additional free parameters in the target task). Specifically, we show that the usefulness of a transfer learning setting is fragile and depends on a delicate interplay among the set of transferred parameters, the relation between the tasks, and the true solution. We also demonstrate that overparameterized transfer learning is not necessarily more beneficial when the source task is closer or identical to the target task.
    Fixed points of nonnegative neural networks. (arXiv:2106.16239v4 [stat.ML] UPDATED)
    (2 min) We derive conditions for the existence of fixed points of nonnegative neural networks, an important research objective to understand the behavior of neural networks in modern applications involving autoencoders and loop unrolling techniques, among others. In particular, we show that neural networks with nonnegative inputs and nonnegative parameters can be recognized as monotonic and (weakly) scalable functions within the framework of nonlinear Perron-Frobenius theory. This fact enables us to derive conditions for the existence of a nonempty fixed point set of the nonnegative neural networks, and these conditions are weaker than those obtained recently using arguments in convex analysis, which are typically based on the assumption of nonexpansivity of the activation functions. Furthermore, we prove that the shape of the fixed point set of monotonic and weakly scalable neural networks is often an interval, which degenerates to a point for the case of scalable networks. The chief results of this paper are verified in numerical simulations, where we consider an autoencoder-type network that first compresses angular power spectra in massive MIMO systems, and, second, reconstruct the input spectra from the compressed signals.
    Disentangling homophily, community structure and triadic closure in networks. (arXiv:2101.02510v3 [cs.SI] UPDATED)
    (2 min) Network homophily, the tendency of similar nodes to be connected, and transitivity, the tendency of two nodes being connected if they share a common neighbor, are conflated properties in network analysis, since one mechanism can drive the other. Here we present a generative model and corresponding inference procedure that are capable of distinguishing between both mechanisms. Our approach is based on a variation of the stochastic block model (SBM) with the addition of triadic closure edges, and its inference can identify the most plausible mechanism responsible for the existence of every edge in the network, in addition to the underlying community structure itself. We show how the method can evade the detection of spurious communities caused solely by the formation of triangles in the network, and how it can improve the performance of edge prediction when compared to the pure version of the SBM without triadic closure.
    Cooperative learning for multi-view analysis. (arXiv:2112.12337v3 [stat.ME] UPDATED)
    (2 min) We propose a new method for supervised learning with multiple sets of features ("views"). Cooperative learning combines the usual squared error loss of predictions with an "agreement" penalty to encourage the predictions from different data views to agree. By varying the weight of the agreement penalty, we get a continuum of solutions that include the well-known early and late fusion approaches. Cooperative learning chooses the degree of agreement (or fusion) in an adaptive manner, using a validation set or cross-validation to estimate test set prediction error. One version of our fitting procedure is modular, where one can choose different fitting mechanisms (e.g. lasso, random forests, boosting, neural networks) appropriate for different data views. In the setting of cooperative regularized linear regression, the method combines the lasso penalty with the agreement penalty. The method can be especially powerful when the different data views share some underlying relationship in their signals that we aim to strengthen, while each view has its idiosyncratic noise that we aim to reduce. We illustrate the effectiveness of our proposed method on simulated and real data examples.
    Reliability Estimation of an Advanced Nuclear Fuel using Coupled Active Learning, Multifidelity Modeling, and Subset Simulation. (arXiv:2201.02172v1 [stat.AP])
    (2 min) Tristructural isotropic (TRISO)-coated particle fuel is a robust nuclear fuel and determining its reliability is critical for the success of advanced nuclear technologies. However, TRISO failure probabilities are small and the associated computational models are expensive. We used coupled active learning, multifidelity modeling, and subset simulation to estimate the failure probabilities of TRISO fuels using several 1D and 2D models. With multifidelity modeling, we replaced expensive high-fidelity (HF) model evaluations with information fusion from two low-fidelity (LF) models. For the 1D TRISO models, we considered three multifidelity modeling strategies: only Kriging, Kriging LF prediction plus Kriging correction, and deep neural network (DNN) LF prediction plus Kriging correction. While the results across these multifidelity modeling strategies compared satisfactorily, strategies employing information fusion from two LF models consistently called the HF model least often. Next, for the 2D TRISO model, we considered two multifidelity modeling strategies: DNN LF prediction plus Kriging correction (data-driven) and 1D TRISO LF prediction plus Kriging correction (physics-based). The physics-based strategy, as expected, consistently required the fewest calls to the HF model. However, the data-driven strategy had a lower overall simulation time since the DNN predictions are instantaneous, and the 1D TRISO model requires a non-negligible simulation time.
    Jointly Learning Environments and Control Policies with Projected Stochastic Gradient Ascent. (arXiv:2006.01738v4 [cs.LG] UPDATED)
    (2 min) We consider the joint design and control of discrete-time stochastic dynamical systems over a finite time horizon. We formulate the problem as a multi-step optimization problem under uncertainty seeking to identify a system design and a control policy that jointly maximize the expected sum of rewards collected over the time horizon considered. The transition function, the reward function and the policy are all parametrized, assumed known and differentiable with respect to their parameters. We then introduce a deep reinforcement learning algorithm combining policy gradient methods with model-based optimization techniques to solve this problem. In essence, our algorithm iteratively approximates the gradient of the expected return via Monte-Carlo sampling and automatic differentiation and takes projected gradient ascent steps in the space of environment and policy parameters. This algorithm is referred to as Direct Environment and Policy Search (DEPS). We assess the performance of our algorithm in three environments concerned with the design and control of a mass-spring-damper system, a small-scale off-grid power system and a drone, respectively. In addition, our algorithm is benchmarked against a state-of-the-art deep reinforcement learning algorithm used to tackle joint design and control problems. We show that DEPS performs at least as well or better in all three environments, consistently yielding solutions with higher returns in fewer iterations. Finally, solutions produced by our algorithm are also compared with solutions produced by an algorithm that does not jointly optimize environment and policy parameters, highlighting the fact that higher returns can be achieved when joint optimization is performed.
    Gaussian Imagination in Bandit Learning. (arXiv:2201.01902v1 [cs.LG])
    (2 min) Assuming distributions are Gaussian often facilitates computations that are otherwise intractable. We consider an agent who is designed to attain a low information ratio with respect to a bandit environment with a Gaussian prior distribution and a Gaussian likelihood function, but study the agent's performance when applied instead to a Bernoulli bandit. We establish a bound on the increase in Bayesian regret when an agent interacts with the Bernoulli bandit, relative to an information-theoretic bound satisfied with the Gaussian bandit. If the Gaussian prior distribution and likelihood function are sufficiently diffuse, this increase grows with the square-root of the time horizon, and thus the per-timestep increase vanishes. Our results formalize the folklore that so-called Bayesian agents remain effective when instantiated with diffuse misspecified distributions.
    Towards a theory of out-of-distribution learning. (arXiv:2109.14501v4 [stat.ML] UPDATED)
    (2 min) What is learning? 20$^{st}$ century formalizations of learning theory -- which precipitated revolutions in artificial intelligence -- focus primarily on $\mathit{in-distribution}$ learning, that is, learning under the assumption that the training data are sampled from the same distribution as the evaluation distribution. This assumption renders these theories inadequate for characterizing 21$^{st}$ century real world data problems, which are typically characterized by evaluation distributions that differ from the training data distributions (referred to as out-of-distribution learning). We therefore make a small change to existing formal definitions of learnability by relaxing that assumption. We then introduce $\mathbf{learning\ efficiency}$ (LE) to quantify the amount a learner is able to leverage data for a given problem, regardless of whether it is an in- or out-of-distribution problem. We then define and prove the relationship between generalized notions of learnability, and show how this framework is sufficiently general to characterize transfer, multitask, meta, continual, and lifelong learning. We hope this unification helps bridge the gap between empirical practice and theoretical guidance in real world problems. Finally, because biological learning continues to outperform machine learning algorithms on certain OOD challenges, we discuss the limitations of this framework vis-\'a-vis its ability to formalize biological learning, suggesting multiple avenues for future research.
    Discovering contemporaneous and lagged causal relations in autocorrelated nonlinear time series datasets. (arXiv:2003.03685v2 [stat.ME] UPDATED)
    (2 min) The paper introduces a novel conditional independence (CI) based method for linear and nonlinear, lagged and contemporaneous causal discovery from observational time series in the causally sufficient case. Existing CI-based methods such as the PC algorithm and also common methods from other frameworks suffer from low recall and partially inflated false positives for strong autocorrelation which is an ubiquitous challenge in time series. The novel method, PCMCI$^+$, extends PCMCI [Runge et al., 2019b] to include discovery of contemporaneous links. PCMCI$^+$ improves the reliability of CI tests by optimizing the choice of conditioning sets and even benefits from autocorrelation. The method is order-independent and consistent in the oracle case. A broad range of numerical experiments demonstrates that PCMCI$^+$ has higher adjacency detection power and especially more contemporaneous orientation recall compared to other methods while better controlling false positives. Optimized conditioning sets also lead to much shorter runtimes than the PC algorithm. PCMCI$^+$ can be of considerable use in many real world application scenarios where often time resolutions are too coarse to resolve time delays and strong autocorrelation is present.
    Online nonnegative CP-dictionary learning for Markovian data. (arXiv:2009.07612v3 [stat.ML] UPDATED)
    (2 min) Online Tensor Factorization (OTF) is a fundamental tool in learning low-dimensional interpretable features from streaming multi-modal data. While various algorithmic and theoretical aspects of OTF have been investigated recently, a general convergence guarantee to stationary points of the objective function without any incoherence or sparsity assumptions is still lacking even for the i.i.d. case. In this work, we introduce a novel algorithm that learns a CANDECOMP/PARAFAC (CP) basis from a given stream of tensor-valued data under general constraints, including nonnegativity constraints that induce interpretability of the learned CP basis. We prove that our algorithm converges almost surely to the set of stationary points of the objective function under the hypothesis that the sequence of data tensors is generated by an underlying Markov chain. Our setting covers the classical i.i.d. case as well as a wide range of application contexts including data streams generated by independent or MCMC sampling. Our result closes a gap between OTF and Online Matrix Factorization in global convergence analysis \commHL{for CP-decompositions}. Experimentally, we show that our algorithm converges much faster than standard algorithms for nonnegative tensor factorization tasks on both synthetic and real-world data. Also, we demonstrate the utility of our algorithm on a diverse set of examples from an image, video, and time-series data, illustrating how one may learn qualitatively different CP-dictionaries from the same tensor data by exploiting the tensor structure in multiple ways.
    Federated Optimization of Smooth Loss Functions. (arXiv:2201.01954v1 [cs.LG])
    (2 min) In this work, we study empirical risk minimization (ERM) within a federated learning framework, where a central server minimizes an ERM objective function using training data that is stored across $m$ clients. In this setting, the Federated Averaging (FedAve) algorithm is the staple for determining $\epsilon$-approximate solutions to the ERM problem. Similar to standard optimization algorithms, the convergence analysis of FedAve only relies on smoothness of the loss function in the optimization parameter. However, loss functions are often very smooth in the training data too. To exploit this additional smoothness, we propose the Federated Low Rank Gradient Descent (FedLRGD) algorithm. Since smoothness in data induces an approximate low rank structure on the loss function, our method first performs a few rounds of communication between the server and clients to learn weights that the server can use to approximate clients' gradients. Then, our method solves the ERM problem at the server using inexact gradient descent. To show that FedLRGD can have superior performance to FedAve, we present a notion of federated oracle complexity as a counterpart to canonical oracle complexity. Under some assumptions on the loss function, e.g., strong convexity in parameter, $\eta$-H\"older smoothness in data, etc., we prove that the federated oracle complexity of FedLRGD scales like $\phi m(p/\epsilon)^{\Theta(d/\eta)}$ and that of FedAve scales like $\phi m(p/\epsilon)^{3/4}$ (neglecting sub-dominant factors), where $\phi\gg 1$ is a "communication-to-computation ratio," $p$ is the parameter dimension, and $d$ is the data dimension. Then, we show that when $d$ is small and the loss function is sufficiently smooth in the data, FedLRGD beats FedAve in federated oracle complexity. Finally, in the course of analyzing FedLRGD, we also establish a result on low rank approximation of latent variable models.
    Robust Linear Predictions: Analyses of Uniform Concentration, Fast Rates and Model Misspecification. (arXiv:2201.01973v1 [stat.ML])
    (2 min) The problem of linear predictions has been extensively studied for the past century under pretty generalized frameworks. Recent advances in the robust statistics literature allow us to analyze robust versions of classical linear models through the prism of Median of Means (MoM). Combining these approaches in a piecemeal way might lead to ad-hoc procedures, and the restricted theoretical conclusions that underpin each individual contribution may no longer be valid. To meet these challenges coherently, in this study, we offer a unified robust framework that includes a broad variety of linear prediction problems on a Hilbert space, coupled with a generic class of loss functions. Notably, we do not require any assumptions on the distribution of the outlying data points ($\mathcal{O}$) nor the compactness of the support of the inlying ones ($\mathcal{I}$). Under mild conditions on the dual norm, we show that for misspecification level $\epsilon$, these estimators achieve an error rate of $O(\max\left\{|\mathcal{O}|^{1/2}n^{-1/2}, |\mathcal{I}|^{1/2}n^{-1} \right\}+\epsilon)$, matching the best-known rates in literature. This rate is slightly slower than the classical rates of $O(n^{-1/2})$, indicating that we need to pay a price in terms of error rates to obtain robust estimates. Additionally, we show that this rate can be improved to achieve so-called ``fast rates" under additional assumptions.
    The dynamics of representation learning in shallow, non-linear autoencoders. (arXiv:2201.02115v1 [stat.ML])
    (2 min) Autoencoders are the simplest neural network for unsupervised learning, and thus an ideal framework for studying feature learning. While a detailed understanding of the dynamics of linear autoencoders has recently been obtained, the study of non-linear autoencoders has been hindered by the technical difficulty of handling training data with non-trivial correlations - a fundamental prerequisite for feature extraction. Here, we study the dynamics of feature learning in non-linear, shallow autoencoders. We derive a set of asymptotically exact equations that describe the generalisation dynamics of autoencoders trained with stochastic gradient descent (SGD) in the limit of high-dimensional inputs. These equations reveal that autoencoders learn the leading principal components of their inputs sequentially. An analysis of the long-time dynamics explains the failure of sigmoidal autoencoders to learn with tied weights, and highlights the importance of training the bias in ReLU autoencoders. Building on previous results for linear networks, we analyse a modification of the vanilla SGD algorithm which allows learning of the exact principal components. Finally, we show that our equations accurately describe the generalisation dynamics of non-linear autoencoders on realistic datasets such as CIFAR10.
    Efficiently Disentangle Causal Representations. (arXiv:2201.01942v1 [cs.LG])
    (2 min) This paper proposes an efficient approach to learning disentangled representations with causal mechanisms based on the difference of conditional probabilities in original and new distributions. We approximate the difference with models' generalization abilities so that it fits in the standard machine learning framework and can be efficiently computed. In contrast to the state-of-the-art approach, which relies on the learner's adaptation speed to new distribution, the proposed approach only requires evaluating the model's generalization ability. We provide a theoretical explanation for the advantage of the proposed method, and our experiments show that the proposed technique is 1.9--11.0$\times$ more sample efficient and 9.4--32.4 times quicker than the previous method on various tasks. The source code is available at \url{https://github.com/yuanpeng16/EDCR}.
    Yield Spread Selection in Predicting Recession Probabilities: A Machine Learning Approach. (arXiv:2101.09394v2 [econ.EM] UPDATED)
    (2 min) The literature on using yield curves to forecast recessions customarily uses 10-year--three-month Treasury yield spread without verification on the pair selection. This study investigates whether the predictive ability of spread can be improved by letting a machine learning algorithm identify the best maturity pair and coefficients. Our comprehensive analysis shows that, despite the likelihood gain, the machine learning approach does not significantly improve prediction, owing to the estimation error. This is robust to the forecasting horizon, control variable, sample period, and oversampling of the recession observations. Our finding supports the use of the 10-year--three-month spread.
    Representation Learning for Online and Offline RL in Low-rank MDPs. (arXiv:2110.04652v3 [cs.LG] UPDATED)
    (2 min) This work studies the question of Representation Learning in RL: how can we learn a compact low-dimensional representation such that on top of the representation we can perform RL procedures such as exploration and exploitation, in a sample efficient manner. We focus on the low-rank Markov Decision Processes (MDPs) where the transition dynamics correspond to a low-rank transition matrix. Unlike prior works that assume the representation is known (e.g., linear MDPs), here we need to learn the representation for the low-rank MDP. We study both the online RL and offline RL settings. For the online setting, operating with the same computational oracles used in FLAMBE (Agarwal et.al), the state-of-art algorithm for learning representations in low-rank MDPs, we propose an algorithm REP-UCB Upper Confidence Bound driven Representation learning for RL), which significantly improves the sample complexity from $\widetilde{O}( A^9 d^7 / (\epsilon^{10} (1-\gamma)^{22}))$ for FLAMBE to $\widetilde{O}( A^2 d^4 / (\epsilon^2 (1-\gamma)^{5}) )$ with $d$ being the rank of the transition matrix (or dimension of the ground truth representation), $A$ being the number of actions, and $\gamma$ being the discounted factor. Notably, REP-UCB is simpler than FLAMBE, as it directly balances the interplay between representation learning, exploration, and exploitation, while FLAMBE is an explore-then-commit style approach and has to perform reward-free exploration step-by-step forward in time. For the offline RL setting, we develop an algorithm that leverages pessimism to learn under a partial coverage condition: our algorithm is able to compete against any policy as long as it is covered by the offline distribution.
    CausalSim: Toward a Causal Data-Driven Simulator for Network Protocols. (arXiv:2201.01811v1 [cs.LG])
    (2 min) Evaluating the real-world performance of network protocols is challenging. Randomized control trials (RCT) are expensive and inaccessible to most researchers, while expert-designed simulators fail to capture complex behaviors in real networks. We present CausalSim, a data-driven simulator for network protocols that addresses this challenge. Learning network behavior from observational data is complicated due to the bias introduced by the protocols used during data collection. CausalSim uses traces from an initial RCT under a set of protocols to learn a causal network model, effectively removing the biases present in the data. Using this model, CausalSim can then simulate any protocol over the same traces (i.e., for counterfactual predictions). Key to CausalSim is the novel use of adversarial neural network training that exploits distributional invariances that are present due to the training data coming from an RCT. Our extensive evaluation of CausalSim on both real and synthetic datasets and two use cases, including more than nine months of real data from the Puffer video streaming system, shows that it provides accurate counterfactual predictions, reducing prediction error by 44% and 53% on average compared to expert-designed and standard supervised learning baselines.
    A note on efficient minimum cost adjustment sets in causal graphical models. (arXiv:2201.02037v1 [math.ST])
    (2 min) We study the selection of adjustment sets for estimating the interventional mean under an individualized treatment rule. We assume a non-parametric causal graphical model with, possibly, hidden variables and at least one adjustment set comprised of observable variables. Moreover, we assume that observable variables have positive costs associated with them. We define the cost of an observable adjustment set as the sum of the costs of the variables that comprise it. We show that in this setting there exist adjustment sets that are minimum cost optimal, in the sense that they yield non-parametric estimators of the interventional mean with the smallest asymptotic variance among those that control for observable adjustment sets that have minimum cost. Our results are based on the construction of a special flow network associated with the original causal graph. We show that a minimum cost optimal adjustment set can be found by computing a maximum flow on the network, and then finding the set of vertices that are reachable from the source by augmenting paths. The optimaladj Python package implements the algorithms introduced in this paper.
    Group structure estimation for panel data -- a general approach. (arXiv:2201.01793v1 [stat.ME])
    (2 min) Consider a panel data setting where repeated observations on individuals are available. Often it is reasonable to assume that there exist groups of individuals that share similar effects of observed characteristics, but the grouping is typically unknown in advance. We propose a novel approach to estimate such unobserved groupings for general panel data models. Our method explicitly accounts for the uncertainty in individual parameter estimates and remains computationally feasible with a large number of individuals and/or repeated measurements on each individual. The developed ideas can be applied even when individual-level data are not available and only parameter estimates together with some quantification of uncertainty are given to the researcher.
    AIR-Net: Adaptive and Implicit Regularization Neural Network for Matrix Completion. (arXiv:2110.07557v3 [cs.LG] UPDATED)
    (2 min) The explicit low-rank regularization, e.g., nuclear norm regularization, has been widely used in imaging sciences. However, it has been found that implicit regularization outperforms explicit ones in various image processing tasks. Another issue is that the fixed explicit regularization limits the applicability to broad kinds of images since different images favor different features captured by using different explicit regularizations. As such, this paper proposes a new adaptive and implicit low-rank regularization that captures the low-rank prior dynamically from the training data. At the core of our new adaptive and implicit low-rank regularization is parameterizing the Laplacian matrix in the Dirichlet energy-based regularization with a neural network, and we call the proposed model \textit{AIR-Net}. Theoretically, we show that the adaptive regularization of AIR-Net enhances the implicit regularization and vanishes at the end of training. We validate AIR-Net's effectiveness on various benchmark tasks, indicating that the AIR-Net is particularly favorable for the scenarios when the missing entries are non-uniform. The code can be found at \href{https://github.com/lizhemin15/AIR-Net}{https://github.com/lizhemin15/AIR-Net}.
    Learning in Markov Decision Processes under Constraints. (arXiv:2002.12435v5 [cs.LG] UPDATED)
    (2 min) We consider reinforcement learning (RL) in Markov Decision Processes in which an agent repeatedly interacts with an environment that is modeled by a controlled Markov process. At each time step $t$, it earns a reward, and also incurs a cost-vector consisting of $M$ costs. We design model-based RL algorithms that maximize the cumulative reward earned over a time horizon of $T$ time-steps, while simultaneously ensuring that the average values of the $M$ cost expenditures are bounded by agent-specified thresholds $c^{ub}_i,i=1,2,\ldots,M$. In order to measure the performance of a reinforcement learning algorithm that satisfies the average cost constraints, we define an $M+1$ dimensional regret vector that is composed of its reward regret, and $M$ cost regrets. The reward regret measures the sub-optimality in the cumulative reward, while the $i$-th component of the cost regret vector is the difference between its $i$-th cumulative cost expense and the expected cost expenditures $Tc^{ub}_i$. We prove that the expected value of the regret vector of UCRL-CMDP, is upper-bounded as $\tilde{O}\left(T^{2\slash 3}\right)$, where $T$ is the time horizon. We further show how to reduce the regret of a desired subset of the $M$ costs, at the expense of increasing the regrets of rewards and the remaining costs. To the best of our knowledge, ours is the only work that considers non-episodic RL under average cost constraints, and derive algorithms that can~\emph{tune the regret vector} according to the agent's requirements on its cost regrets.
  • John D. Cook

    Three paths converge
    (2 min) When does the equation x2 + 7 = 2n have integer solutions? This is an interesting question, but why would anyone ask it? This post looks at three paths that have led to this problem. Ramanujan Ramanujan [1] considered this problem in 1913. He found five solutions and conjectured that there were no more. Then […] Three paths converge first appeared on John D. Cook.
  • Artificial Intelligence

    Coqui Introduces ‘YourTTS’: A Zero-Sample Text-to-Speech Model With State-of-The-Art (SOTA) Results
    (1 min) Recent advancements in end-to-end deep learning models have enabled new and intriguing Text-to-Speech (TTS) use-cases with excellent natural-sounding outcomes. However, the majority of these models are trained on large datasets recorded with a single speaker in a professional setting. Expanding solutions to numerous languages and speakers is not viable for everyone in this situation. It is more challenging for low-resource languages not often studied by mainstream research. Coqui’s team has designed ‘YourTTS‘ to overcome these limits and provide zero-shot TTS to low-resource languages. It can synthesize voices in various languages and drastically reduce data requirements by transferring information between the training set. Continue Reading.... Paper: https://arxiv.org/pdf/2112.02418.pdf Github: https://github.com/coqui-ai/TTS submitted by /u/ai-lover [link] [comments]
    ThisLocationDoesNotExist
    (1 min) ​ https://preview.redd.it/d3cvvcxo1ia81.png?width=947&format=png&auto=webp&s=af6454b6f683a410f2c4b0be0fc52c1cd307ae63 Satellite images can be easily re-purposed by using GANs that can synthesize fake aerial images. Check this out: https://mayachitra-thislocationdoesnotexist.azurewebsites.net/ PS: GANs are getting stronger day by day. This post is just one of the million examples out there! submitted by /u/ArmadilloFabulous774 [link] [comments]
    Eric Jang on Robots Learning at Google and Generalization via Language
    (0 min) submitted by /u/regalalgorithm [link] [comments]
    What is the best AI for making face edits?
    (0 min) submitted by /u/xXNOdrugsForMEXx [link] [comments]
    Biomedical relationship extraction for knowledge graph creation using machine learning
    (0 min) submitted by /u/dreadknight011 [link] [comments]
    Coursera Plus Annual Subscription is available at $100 discount until 1/13 if you fancy
    (0 min) submitted by /u/biggbrother23 [link] [comments]
    Yale University and IBM Researchers Introduce Kernel Graph Neural Networks (KerGNNs)
    (1 min) Graph kernel approaches have typically been the most popular strategy for graph classification tasks. Graph kernels can be thought of as functions that measure the similarity of two graphs. They allow kernelized learning algorithms like support vector machines to work directly on charts rather than convert them to fixed-length, real-valued feature vectors through feature extraction. In recent years, the use of Graph Neural Networks (GNNs) based on high-performance message-passing neural networks has exploded (MPNNs). As a result, they’ve grown increasingly popular for graph categorization. However, their performance is limited by their hand-crafted combinatorial features. Yale University and IBM researchers propose Kernel Graph Neural Networks (KerGNNs). KerGNNs are frameworks that combine graph kernels and the GNN message-passing procedure into one. They achieve results that are equivalent to cutting-edge approaches. Simultaneously, they vastly increase model interpretability when compared to traditional GNNs. Continue Reading.... Paper: https://arxiv.org/pdf/2201.00491.pdf ​ https://preview.redd.it/7e9twrtvpea81.png?width=1920&format=png&auto=webp&s=c96b50dac4f48ac18c913bbd0f1cf9a4cd0defc8 submitted by /u/ai-lover [link] [comments]
    Amazon Original Upload: Are we Inching Towards this Dystopian Reality with Metaverse?
    (0 min) submitted by /u/yadavvenugopal [link] [comments]
  • Machine Learning

    [D] Eric Jang on Robots Learning at Google and Generalization via Language
    (1 min) Hi there, just want to share out latest Gradient interview with Eric Jang, a research scientist on Google AI's Robotics team. He has worked a good deal with they arm farm, and more recently on imitation learning and reinforcement learning for enabling robots to generalize to many tasks. Here's the link: Eric Jang on Robots Learning at Google and Generalization via Language As usual, we get pretty technical and most of it going over his research, as well as his recent blog posts. Sections: (00:00) Intro (00:50) Start in AI / Research (03:58) Joining Google Robotics (10:08) End to End Learning of Semantic Grasping (19:11) Off Policy RL for Robotic Grasping (29:33) Grasp2Vec (40:50) Watch, Try, Learn Meta-Learning from Demonstrations and Rewards (50:12) BC-Z: Zero-Shot Task Generalization with Robotic Imitation Learning (59:41) Just Ask for Generalization (01:09:02) Data for Robotics (01:22:10) To Understand Language is to Understand Generalization (01:32:38) Outro Papers discussed: Grasp2Vec: Learning Object Representations from Self-Supervised Grasping End-to-End Learning of Semantic Grasping Deep reinforcement learning for vision-based robotic grasping: A simulated comparative evaluation of off-policy methods Watch, Try, Learn Meta-Learning from Demonstrations and Rewards BC-Z: Zero-Shot Task Generalization with Robotic Imitation Learning Just Ask for Generalization To Understand Language is to Understand Generalization Robots Must Be Ephemeralized submitted by /u/regalalgorithm [link] [comments]
    Hypothetical: testing a spiked neural network in an android [D] [R] [P]
    (1 min) I am working on a hard sci fi project set several hundred years in the future. I'd like to show a data scientist running realistic (but interesting) tests on an android to see if it's core SNN is intact. Ideally, this would be some kind of behavioral simulation, a series of tests that the android has to "pass." Thought I should come to the experts for suggestions! submitted by /u/Sensitive_Necessary7 [link] [comments]
    [D] How does 10-shot work?
    (0 min) I am trying to get into contrastive learning, and I have stumbled upon benchmarks like this: https://paperswithcode.com/sota/few-shot-image-classification-on-imagenet-10 What exactly is meant by 10-shot? Is it just training on 10 random examples (selected beforehand for fairness)? How does it work? submitted by /u/AICoderGamer [link] [comments]
    [P] Weekly updated open sourced model implementations in Flax
    (1 min) Even though the popularity of JAX is increasing everyday, open sourced code implementations are still hard to find. To counter this, I have created a repository with the goal to achieve something similar to timm for torch (except for the pretrained models because I don't have the funds currently). Some features of jax_models: 1. Pip installable 2. Weekly / fortnightly model additions 3. Comprehensive tests 4. Model architectures are cross checked by checking original implementations if available or otherwise miscellaneous aspects of the paper such as model parameters The repository is currently in its pre-alpha stage with only around 6 model and layer implementations. If you are interested in this, you can contribute by either asking for specific paper implementations or by helping me train and open source weights of these models. If you like my work then please consider starring the repository and giving feedback. Thank you! Link: https://github.com/DarshanDeshpande/jax-models submitted by /u/Megixist [link] [comments]
    [R] A 5 million source code file dataset
    (1 min) I published a dataset of 5m source code files from 15k open source files. Its long term goal is to enable identifying causality in software engineering. Describing paper: End to End Software Engineering Research People from ML, NLP, causality, and software engineering, might find it interesting. The dataset enables investigating code similarity (and also text similarity in English), program difficulty, defect predictions, etc. The code is extracted every two months in order to investigate the difference. By the difference one can investigate if a change in a possible cause (e.g., smell removal) leads to influence (less bugs)and in which context. I plan to keep extending the dataset and would like to get feedback on it - ease of use, new use cases, etc. If you have related dataset that can be merged with, related data that you would like to obtain or ideas for research directions, please contact me. submitted by /u/idan_huji [link] [comments]
    [R] The PyMC devs have made their books available free online!
    (1 min) Forward: "Bayesian modeling provides an elegant approach to many data science and decision-making problems. However, it can be hard to make it work well in practice. In particular, although there are many software packages that make it easy to specify complex hierarchical models such as Stan, PyMC3, TensorFlow Probability (TFP), and Pyro, users still need additional tools to diagnose whether the results of their computations are correct or not. They may also need advice on what to do when things do go wrong. This book focuses on the ArviZ library, which enables users to perform exploratory analysis of Bayesian models, for example, diagnostics of posterior samples generated by any inference method. This can be used to diagnose a variety of failure modes in Bayesian inference. The book also discusses various modeling strategies (such as centering) that can be employed to eliminate many of the most common problems. Most of the examples in the book use PyMC3, although some also use TFP; a brief comparison of other probabilistic programming languages is also included. The authors are all experts in the area of Bayesian software and are major contributors to the PyMC3, ArviZ, and TFP libraries. They also have significant experience applying Bayesian data analysis in practice, and this is reflected in the practical approach adopted in this book. Overall, I think this is a valuable addition to the literature, which should hopefully further the adoption of Bayesian methods." Link: https://bayesiancomputationbook.com/welcome.html submitted by /u/bikeskata [link] [comments]
    ‎StyleGAN 2 and PSP running on Apple Neural Engine [P]
    (1 min) submitted by /u/surelyouarejoking [link] [comments]
    [D] Fourier transform vs NNs as function approximators
    (6 min) So this is probably a basic question. If the main premise of neural networks is that they are global function approximators, what advantage do they have against other approximators such Fourier transform, which is also proven to be able to approximate any function. Why does not the whole supervised learning field become one of calculating Fourier coefficients submitted by /u/Hazalem [link] [comments]
    [D] How often is the case that a conference submission is rejected everywhere?
    (3 min) My understanding is that if your research paper is denied from a conference, you would take the reviewer's feedback, make some changes, then submit to the next big ML conference. The process repeats until the paper is eventually accepted somewhere. But how often is it the case where the paper isn't accepted anywhere? How many papers are accepted on the first (or second) try? I'm an undergrad who's really new with the concept of publishing papers, so I thought it's worth asking around here. :) submitted by /u/akardashian [link] [comments]
    [D] Is there a complete list of modern different AI algorithms and its main application?
    (1 min) Like DNN for all domains, Reinforcement learning for? Genetic algorithm for ? Symbolic AI for ? ... submitted by /u/ghosthamlet [link] [comments]
    [D] Questions about paper "Pay Attention to MLPs"
    (1 min) In Pay Attention to MLPs, the authors said that it is necessary for the layer s(') to contain a contraction operation over the spatial dimension. It's hard to understand why s(') should be a contraction operation. I think s(') should maintain the spatial dimension for proper calculation. Why should s(') be a contraction operation? Can I say transformer dynamically parameterize spatial interactions with positional encoder? I am confused whether spatial interaction already contains the meaning of positional encoding. What do you think? Thank you in advance. submitted by /u/Spiritual_Fig3632 [link] [comments]
    Can we use data available for "non-commercial use" for training a DL model which will be deployed as a subsystem in a commercial product? [D]
    (3 min) I am having a lot of difficulties getting the right data for my task since most of the datasets out there including ImageNet restrict the usage of the data to only educational and non-commercial research. In this case, should I reach out to them to get usage permission? What are the chances that they will allow/respond and would there be a significant cost associated with this? Thanks! submitted by /u/PinPitiful [link] [comments]
  • Neural Networks, Deep Learning and Machine Learning

    A Neural Network Solves and Generates Mathematics Problems by Program Synthesis: Calculus, Differential Equations, Linear Algebra, and More
    (1 min) submitted by /u/nickb [link] [comments]
    Optimization of the neural network with the use of an algorithm simulated annealing
    (1 min) Hi, Have any of you tried to optimize the neural network using the simulated annealing algorithm? I am referring to an example even on a simple neural network because I have a similar project and I will consider how to get started with the topic, so practical use would be extremely helpful ... Additionally, some scientific papers or hypothetical examples of use may also be helpful. Thanks in advance for your help. submitted by /u/theGrEaTmPm [link] [comments]
    Newbie needs help generating data charts for pattern recognition
    (1 min) Hello guys, I am a newbie to neural nets. I need your help I have 6 data patterns in a quality control chart and I need from those 6 graphs to generate more of each of them in order to use them to train a neural net do identify them in random charts. I have tried using Gaussian distribution noise but something doesn't work well. I prefer using python to generate them. submitted by /u/aherontas [link] [comments]
    Yale University and IBM Researchers Introduce Kernel Graph Neural Networks (KerGNNs)
    (1 min) Graph kernel approaches have typically been the most popular strategy for graph classification tasks. Graph kernels can be thought of as functions that measure the similarity of two graphs. They allow kernelized learning algorithms like support vector machines to work directly on charts rather than convert them to fixed-length, real-valued feature vectors through feature extraction. In recent years, the use of Graph Neural Networks (GNNs) based on high-performance message-passing neural networks has exploded (MPNNs). As a result, they’ve grown increasingly popular for graph categorization. However, their performance is limited by their hand-crafted combinatorial features. Yale University and IBM researchers propose Kernel Graph Neural Networks (KerGNNs). KerGNNs are frameworks that combine graph kernels and the GNN message-passing procedure into one. They achieve results that are equivalent to cutting-edge approaches. Simultaneously, they vastly increase model interpretability when compared to traditional GNNs. Continue Reading.... Paper: https://arxiv.org/pdf/2201.00491.pdf https://preview.redd.it/pe289oqwpea81.png?width=1920&format=png&auto=webp&s=31c2a51a4f57303dec5f41680a9e4517f4f662be submitted by /u/ai-lover [link] [comments]
    Image GANs meet Differentiable Rendering for Inverse Graphics and Interpretable 3D Neural Rendering (aka 3D objects from a single image)
    (1 min) submitted by /u/nickb [link] [comments]
  • Reinforcement Learning

    Hiring: My team is hiring an 2 interns at Intel. Read post description
    (2 min) Looking for 2 interns for summer. There are few pre-reqs. Must be in US. Must enroll in a Ph.D program (Sorry no MS and BS students) Must have a track record for doing RL research, i.e 2 to 3 publications at least. If you are interested, please send me a link to your resume. Thanks Edit: hiring 2 interns submitted by /u/schrodingershit [link] [comments]
    Offline RL using Policy Gradients
    (1 min) I have a dataset of events that take place in football (soccer) games. This is being framed as a reinforcement learning issue by defining episodes to be the events (pass, shoot, dribble, etc) that take place when a team gains possession and loses it (by shooting, losing possession, etc). I have created a simple NN that estimates what the next action will be from the possible action space as a probability distribution. This NN was trained on the historical data from past matches. I would like to know if there are any good pointers into implementing Policy Gradients to nudge the probability distribution learned by the NN in the direction of some reward function that I define. Ideally the advice would be for tensorflow as that is what I am familiar with. submitted by /u/uom_questions [link] [comments]
    Can a supervised ANN detect a cat and a dog or just a cat?
    (2 min) So a NN is good at scanning a thousand labelled pictures of a cat and then when it see's a cat it has a high likelihood of knowing it's a cat (if it is trained right). Can that same nn (with same or similar configuration) also determine a cat from a dog from a range of random animals, assuming weights were adjusted and training redone? Would I need two NN's for that? Does first NN think it's a cat? No? Okay - does 2nd NN think it's a dog? Yes! submitted by /u/Togfox [link] [comments]
    RL (finance focused) Study Group
    (1 min) Hey everyone - I am looking for people (preferably MSc or PhD students/ grads) who are interested to learn applied RL in finance. I'm looking to create a group of 5-8 people to meet weekly or biweekly for around 1-2hrs to discuss papers and potential projects that we can collaborate on. Please let me know if interested. submitted by /u/ivan_426 [link] [comments]

2022-01-07

  • cs.LG updates on arXiv.org

    Using Deep Learning with Large Aggregated Datasets for COVID-19 Classification from Cough. (arXiv:2201.01669v1 [eess.AS])
    (2 min) The Covid-19 pandemic has been a scourge upon humanity, claiming the lives of more than 5 million people worldwide. Although vaccines are being distributed worldwide, there is an apparent need for affordable screening techniques to serve parts of the world that do not have access to traditional medicine. Artificial Intelligence can provide a solution utilizing cough sounds as the primary screening mode. This paper presents multiple models that have achieved relatively respectable perfor mance on the largest evaluation dataset currently presented in academic literature. Moreover, we also show that performance increases with training data size, showing the need for the world wide collection of data to help combat the Covid-19 pandemic with non-traditional means.
    Predicting trajectory behaviour via machine-learned invariant manifolds. (arXiv:2107.10154v2 [physics.chem-ph] UPDATED)
    (2 min) In this paper, we use support vector machines (SVM) to develop a machine learning framework to discover phase space structures that distinguish between distinct reaction pathways. The SVM model is trained using data from trajectories of Hamilton's equations and works well even with relatively few trajectories. Moreover, this framework is specifically designed to require minimal a priori knowledge of the dynamics in a system. This makes our approach computationally better suited than existing methods for high-dimensional systems and systems where integrating trajectories is expensive. We benchmark our approach on Chesnavich's CH$_4^+$ Hamiltonian.
    Suppression of Correlated Noise with Similarity-based Unsupervised Deep Learning. (arXiv:2011.03384v6 [cs.LG] UPDATED)
    (0 min) Image denoising is a prerequisite for downstream tasks in many fields. Low-dose and photon-counting computed tomography (CT) denoising can optimize diagnostic performance at minimized radiation dose. Supervised deep denoising methods are popular but require paired clean or noisy samples that are often unavailable in practice. Limited by the independent noise assumption, current unsupervised denoising methods cannot process correlated noises as in CT images. Here we propose the first-of-its-kind similarity-based unsupervised deep denoising approach, referred to as Noise2Sim, that works in a nonlocal and nonlinear fashion to suppress not only independent but also correlated noises. Theoretically, Noise2Sim is asymptotically equivalent to supervised learning methods under mild conditions. Experimentally, Nosie2Sim recovers intrinsic features from noisy low-dose CT and photon-counting CT images as effectively as or even better than supervised learning methods on practical datasets visually, quantitatively and statistically. Noise2Sim is a general unsupervised denoising approach and has great potential in diverse applications.
    Latte: Cross-framework Python Package for Evaluation of Latent-Based Generative Models. (arXiv:2112.10638v2 [cs.LG] UPDATED)
    (0 min) Latte (for LATent Tensor Evaluation) is a Python library for evaluation of latent-based generative models in the fields of disentanglement learning and controllable generation. Latte is compatible with both PyTorch and TensorFlow/Keras, and provides both functional and modular APIs that can be easily extended to support other deep learning frameworks. Using NumPy-based and framework-agnostic implementation, Latte ensures reproducible, consistent, and deterministic metric calculations regardless of the deep learning framework of choice.
    End-to-End Autoencoder Communications with Optimized Interference Suppression. (arXiv:2201.01388v1 [cs.IT])
    (0 min) An end-to-end communications system based on Orthogonal Frequency Division Multiplexing (OFDM) is modeled as an autoencoder (AE) for which the transmitter (coding and modulation) and receiver (demodulation and decoding) are represented as deep neural networks (DNNs) of the encoder and decoder, respectively. This AE communications approach is shown to outperform conventional communications in terms of bit error rate (BER) under practical scenarios regarding channel and interference effects as well as training data and embedded implementation constraints. A generative adversarial network (GAN) is trained to augment the training data when there is not enough training data available. Also, the performance is evaluated in terms of the DNN model quantization and the corresponding memory requirements for embedded implementation. Then, interference training and randomized smoothing are introduced to train the AE communications to operate under unknown and dynamic interference (jamming) effects on potentially multiple OFDM symbols. Relative to conventional communications, up to 36 dB interference suppression for a channel reuse of four can be achieved by the AE communications with interference training and randomized smoothing. AE communications is also extended to the multiple-input multiple-output (MIMO) case and its BER performance gain with and without interference effects is demonstrated compared to conventional MIMO communications.
    Formant Tracking Using Quasi-Closed Phase Forward-Backward Linear Prediction Analysis and Deep Neural Networks. (arXiv:2201.01525v1 [eess.AS])
    (0 min) Formant tracking is investigated in this study by using trackers based on dynamic programming (DP) and deep neural nets (DNNs). Using the DP approach, six formant estimation methods were first compared. The six methods include linear prediction (LP) algorithms, weighted LP algorithms and the recently developed quasi-closed phase forward-backward (QCP-FB) method. QCP-FB gave the best performance in the comparison. Therefore, a novel formant tracking approach, which combines benefits of deep learning and signal processing based on QCP-FB, was proposed. In this approach, the formants predicted by a DNN-based tracker from a speech frame are refined using the peaks of the all-pole spectrum computed by QCP-FB from the same frame. Results show that the proposed DNN-based tracker performed better both in detection rate and estimation error for the lowest three formants compared to reference formant trackers. Compared to the popular Wavesurfer, for example, the proposed tracker gave a reduction of 29%, 48% and 35% in the estimation error for the lowest three formants, respectively.
    A Federated Deep Learning Framework for Privacy Preservation and Communication Efficiency. (arXiv:2001.09782v3 [cs.DC] UPDATED)
    (0 min) Deep learning has achieved great success in many applications. However, its deployment in practice has been hurdled by two issues: the privacy of data that has to be aggregated centrally for model training and high communication overhead due to transmission of a large amount of data usually geographically distributed. Addressing both issues is challenging and most existing works could not provide an efficient solution. In this paper, we develop FedPC, a Federated Deep Learning Framework for Privacy Preservation and Communication Efficiency. The framework allows a model to be learned on multiple private datasets while not revealing any information of training data, even with intermediate data. The framework also minimizes the amount of data exchanged to update the model. We formally prove the convergence of the learning model when training with FedPC and its privacy-preserving property. We perform extensive experiments to evaluate the performance of FedPC in terms of the approximation to the upper-bound performance (when training centrally) and communication overhead. The results show that FedPC maintains the performance approximation of the models within $8.5\%$ of the centrally-trained models when data is distributed to 10 computing nodes. FedPC also reduces the communication overhead by up to $42.20\%$ compared to existing works.
    ROOM: Adversarial Machine Learning Attacks Under Real-Time Constraints. (arXiv:2201.01621v1 [cs.CR])
    (0 min) Advances in deep learning have enabled a wide range of promising applications. However, these systems are vulnerable to Adversarial Machine Learning (AML) attacks; adversarially crafted perturbations to their inputs could cause them to misclassify. Several state-of-the-art adversarial attacks have demonstrated that they can reliably fool classifiers making these attacks a significant threat. Adversarial attack generation algorithms focus primarily on creating successful examples while controlling the noise magnitude and distribution to make detection more difficult. The underlying assumption of these attacks is that the adversarial noise is generated offline, making their execution time a secondary consideration. However, recently, just-in-time adversarial attacks where an attacker opportunistically generates adversarial examples on the fly have been shown to be possible. This paper introduces a new problem: how do we generate adversarial noise under real-time constraints to support such real-time adversarial attacks? Understanding this problem improves our understanding of the threat these attacks pose to real-time systems and provides security evaluation benchmarks for future defenses. Therefore, we first conduct a run-time analysis of adversarial generation algorithms. Universal attacks produce a general attack offline, with no online overhead, and can be applied to any input; however, their success rate is limited because of their generality. In contrast, online algorithms, which work on a specific input, are computationally expensive, making them inappropriate for operation under time constraints. Thus, we propose ROOM, a novel Real-time Online-Offline attack construction Model where an offline component serves to warm up the online algorithm, making it possible to generate highly successful attacks under time constraints.
    Unlabeled sample compression schemes and corner peelings for ample and maximum classes. (arXiv:1812.02099v2 [cs.DM] UPDATED)
    (0 min) We examine connections between combinatorial notions that arise in machine learning and topological notions in cubical/simplicial geometry. These connections enable to export results from geometry to machine learning. Our first main result is based on a geometric construction by Tracy Hall (2004) of a partial shelling of the cross-polytope which can not be extended. We use it to derive a maximum class of VC dimension 3 that has no corners. This refutes several previous works in machine learning from the past 11 years. In particular, it implies that all previous constructions of optimal unlabeled sample compression schemes for maximum classes are erroneous. On the positive side we present a new construction of an unlabeled sample compression scheme for maximum classes. We leave as open whether our unlabeled sample compression scheme extends to ample (a.k.a. lopsided or extremal) classes, which represent a natural and far-reaching generalization of maximum classes. Towards resolving this question, we provide a geometric characterization in terms of unique sink orientations of the 1-skeletons of associated cubical complexes.
    The Effect of Model Compression on Fairness in Facial Expression Recognition. (arXiv:2201.01709v1 [cs.CV])
    (0 min) Deep neural networks have proved hugely successful, achieving human-like performance on a variety of tasks. However, they are also computationally expensive, which has motivated the development of model compression techniques which reduce the resource consumption associated with deep learning models. Nevertheless, recent studies have suggested that model compression can have an adverse effect on algorithmic fairness, amplifying existing biases in machine learning models. With this project we aim to extend those studies to the context of facial expression recognition. To do that, we set up a neural network classifier to perform facial expression recognition and implement several model compression techniques on top of it. We then run experiments on two facial expression datasets, namely the Extended Cohn-Kanade Dataset (CK+DB) and the Real-World Affective Faces Database (RAF-DB), to examine the individual and combined effect that compression techniques have on the model size, accuracy and fairness. Our experimental results show that: (i) Compression and quantisation achieve significant reduction in model size with minimal impact on overall accuracy for both CK+DB and RAF-DB; (ii) in terms of model accuracy, the classifier trained and tested on RAF-DB seems more robust to compression compared to the CK+ DB; (iii) for RAF-DB, the different compression strategies do not seem to increase the gap in predictive performance across the sensitive attributes of gender, race and age which is in contrast with the results on the CK+DB, where compression seems to amplify existing biases for gender. We analyse the results and discuss the potential reasons for our findings.
    Adversarial Feature Desensitization. (arXiv:2006.04621v3 [cs.LG] UPDATED)
    (0 min) Neural networks are known to be vulnerable to adversarial attacks -- slight but carefully constructed perturbations of the inputs which can drastically impair the network's performance. Many defense methods have been proposed for improving robustness of deep networks by training them on adversarially perturbed inputs. However, these models often remain vulnerable to new types of attacks not seen during training, and even to slightly stronger versions of previously seen attacks. In this work, we propose a novel approach to adversarial robustness, which builds upon the insights from the domain adaptation field. Our method, called Adversarial Feature Desensitization (AFD), aims at learning features that are invariant towards adversarial perturbations of the inputs. This is achieved through a game where we learn features that are both predictive and robust (insensitive to adversarial attacks), i.e. cannot be used to discriminate between natural and adversarial data. Empirical results on several benchmarks demonstrate the effectiveness of the proposed approach against a wide range of attack types and attack strengths. Our code is available at https://github.com/BashivanLab/afd.
    Towards a Multi-modal, Multi-task Learning based Pre-training Framework for Document Representation Learning. (arXiv:2009.14457v2 [cs.CL] UPDATED)
    (0 min) Recent approaches in literature have exploited the multi-modal information in documents (text, layout, image) to serve specific downstream document tasks. However, they are limited by their - (i) inability to learn cross-modal representations across text, layout and image dimensions for documents and (ii) inability to process multi-page documents. Pre-training techniques have been shown in Natural Language Processing (NLP) domain to learn generic textual representations from large unlabelled datasets, applicable to various downstream NLP tasks. In this paper, we propose a multi-task learning-based framework that utilizes a combination of self-supervised and supervised pre-training tasks to learn a generic document representation applicable to various downstream document tasks. Specifically, we introduce Document Topic Modelling and Document Shuffle Prediction as novel pre-training tasks to learn rich image representations along with the text and layout representations for documents. We utilize the Longformer network architecture as the backbone to encode the multi-modal information from multi-page documents in an end-to-end fashion. We showcase the applicability of our pre-training framework on a variety of different real-world document tasks such as document classification, document information extraction, and document retrieval. We evaluate our framework on different standard document datasets and conduct exhaustive experiments to compare performance against various ablations of our framework and state-of-the-art baselines.
    FeatureEnVi: Visual Analytics for Feature Engineering Using Stepwise Selection and Semi-Automatic Extraction Approaches. (arXiv:2103.14539v3 [cs.LG] UPDATED)
    (0 min) The machine learning (ML) life cycle involves a series of iterative steps, from the effective gathering and preparation of the data, including complex feature engineering processes, to the presentation and improvement of results, with various algorithms to choose from in every step. Feature engineering in particular can be very beneficial for ML, leading to numerous improvements such as boosting the predictive results, decreasing computational times, reducing excessive noise, and increasing the transparency behind the decisions taken during the training. Despite that, while several visual analytics tools exist to monitor and control the different stages of the ML life cycle (especially those related to data and algorithms), feature engineering support remains inadequate. In this paper, we present FeatureEnVi, a visual analytics system specifically designed to assist with the feature engineering process. Our proposed system helps users to choose the most important feature, to transform the original features into powerful alternatives, and to experiment with different feature generation combinations. Additionally, data space slicing allows users to explore the impact of features on both local and global scales. FeatureEnVi utilizes multiple automatic feature selection techniques; furthermore, it visually guides users with statistical evidence about the influence of each feature (or subsets of features). The final outcome is the extraction of heavily engineered features, evaluated by multiple validation metrics. The usefulness and applicability of FeatureEnVi are demonstrated with two use cases and a case study. We also report feedback from interviews with two ML experts and a visualization researcher who assessed the effectiveness of our system.
    The OARF Benchmark Suite: Characterization and Implications for Federated Learning Systems. (arXiv:2006.07856v3 [cs.LG] UPDATED)
    (0 min) This paper presents and characterizes an Open Application Repository for Federated Learning (OARF), a benchmark suite for federated machine learning systems. Previously available benchmarks for federated learning have focused mainly on synthetic datasets and use a limited number of applications. OARF mimics more realistic application scenarios with publicly available data sets as different data silos in image, text and structured data. Our characterization shows that the benchmark suite is diverse in data size, distribution, feature distribution and learning task complexity. The extensive evaluations with reference implementations show the future research opportunities for important aspects of federated learning systems. We have developed reference implementations, and evaluated the important aspects of federated learning, including model accuracy, communication cost, throughput and convergence time. Through these evaluations, we discovered some interesting findings such as federated learning can effectively increase end-to-end throughput.
    Latent Space based Memory Replay for Continual Learning in Artificial Neural Networks. (arXiv:2111.13297v2 [cs.LG] UPDATED)
    (0 min) Memory replay may be key to learning in biological brains, which manage to learn new tasks continually without catastrophically interfering with previous knowledge. On the other hand, artificial neural networks suffer from catastrophic forgetting and tend to only perform well on tasks that they were recently trained on. In this work we explore the application of latent space based memory replay for classification using artificial neural networks. We are able to preserve good performance in previous tasks by storing only a small percentage of the original data in a compressed latent space version.
    Matrix Completion with Hierarchical Graph Side Information. (arXiv:2201.01728v1 [stat.ML])
    (2 min) We consider a matrix completion problem that exploits social or item similarity graphs as side information. We develop a universal, parameter-free, and computationally efficient algorithm that starts with hierarchical graph clustering and then iteratively refines estimates both on graph clustering and matrix ratings. Under a hierarchical stochastic block model that well respects practically-relevant social graphs and a low-rank rating matrix model (to be detailed), we demonstrate that our algorithm achieves the information-theoretic limit on the number of observed matrix entries (i.e., optimal sample complexity) that is derived by maximum likelihood estimation together with a lower-bound impossibility result. One consequence of this result is that exploiting the hierarchical structure of social graphs yields a substantial gain in sample complexity relative to the one that simply identifies different groups without resorting to the relational structure across them. We conduct extensive experiments both on synthetic and real-world datasets to corroborate our theoretical results as well as to demonstrate significant performance improvements over other matrix completion algorithms that leverage graph side information.
    Standard Vs Uniform Binary Search and Their Variants in Learned Static Indexing: The Case of the Searching on Sorted Data Benchmarking Software Platform. (arXiv:2201.01554v1 [cs.DS])
    (2 min) The Searching on Sorted Data ({\bf SOSD}, in short) is a highly engineered software platform for benchmarking Learned Indexes, those latter being a novel and quite effective proposal of how to search in a sorted table by combining Machine Learning techniques with classic Algorithms. In such a platform and in the related benchmarking experiments, following a natural and intuitive choice, the final search stage is performed via the Standard (textbook) Binary Search procedure. However, recent studies, that do not use Machine Learning predictions, indicate that Uniform Binary Search, streamlined to avoid \vir{branching} in the main loop, is superior in performance to its Standard counterpart when the table to be searched into is relatively small, e.g., fitting in L1 or L2 cache. Analogous results hold for k-ary Search, even on large tables. One would expect an analogous behaviour within Learned Indexes. Via a set of extensive experiments, coherent with the State of the Art, we show that for Learned Indexes, and as far as the {\bf SOSD} software is concerned, the use of the Standard routine (either Binary or k-ary Search) is superior to the Uniform one, across all the internal memory levels. This fact provides a quantitative justification of the natural choice made so far. Our experiments also indicate that Uniform Binary and k-ary Search can be advantageous to use in order to save space in Learned Indexes, while granting a good performance in time. Our findings are of methodological relevance for this novel and fast-growing area and informative to practitioners interested in using Learned Indexes in application domains, e.g., Data Bases and Search Engines.
    Fast Gradient Non-sign Methods. (arXiv:2110.12734v2 [cs.CV] UPDATED)
    (2 min) Adversarial attacks make their success in "fooling" DNNs and among them, gradient-based algorithms become one of the mainstreams. Based on the linearity hypothesis [12], under $\ell_\infty$ constraint, $sign$ operation applied to the gradients is a good choice for generating perturbations. However, the side-effect from such operation exists since it leads to the bias of direction between the real gradients and the perturbations. In other words, current methods contain a gap between real gradients and actual noises, which leads to biased and inefficient attacks. Therefore in this paper, based on the Taylor expansion, the bias is analyzed theoretically and the correction of $\sign$, i.e., Fast Gradient Non-sign Method (FGNM), is further proposed. Notably, FGNM is a general routine, which can seamlessly replace the conventional $sign$ operation in gradient-based attacks with negligible extra computational cost. Extensive experiments demonstrate the effectiveness of our methods. Specifically, ours outperform them by \textbf{27.5\%} at most and \textbf{9.5\%} on average. Our anonymous code is publicly available: \url{https://git.io/mm-fgnm}.
    Sparse Super-Regular Networks. (arXiv:2201.01363v1 [cs.LG])
    (2 min) It has been argued by Thom and Palm that sparsely-connected neural networks (SCNs) show improved performance over fully-connected networks (FCNs). Super-regular networks (SRNs) are neural networks composed of a set of stacked sparse layers of (epsilon, delta)-super-regular pairs, and randomly permuted node order. Using the Blow-up Lemma, we prove that as a result of the individual super-regularity of each pair of layers, SRNs guarantee a number of properties that make them suitable replacements for FCNs for many tasks. These guarantees include edge uniformity across all large-enough subsets, minimum node in- and out-degree, input-output sensitivity, and the ability to embed pre-trained constructs. Indeed, SRNs have the capacity to act like FCNs, and eliminate the need for costly regularization schemes like Dropout. We show that SRNs perform similarly to X-Nets via readily reproducible experiments, and offer far greater guarantees and control over network structure.
    Deep Learning-Based Sparse Whole-Slide Image Analysis for the Diagnosis of Gastric Intestinal Metaplasia. (arXiv:2201.01449v1 [eess.IV])
    (2 min) In recent years, deep learning has successfully been applied to automate a wide variety of tasks in diagnostic histopathology. However, fast and reliable localization of small-scale regions-of-interest (ROI) has remained a key challenge, as discriminative morphologic features often occupy only a small fraction of a gigapixel-scale whole-slide image (WSI). In this paper, we propose a sparse WSI analysis method for the rapid identification of high-power ROI for WSI-level classification. We develop an evaluation framework inspired by the early classification literature, in order to quantify the tradeoff between diagnostic performance and inference time for sparse analytic approaches. We test our method on a common but time-consuming task in pathology - that of diagnosing gastric intestinal metaplasia (GIM) on hematoxylin and eosin (H&E)-stained slides from endoscopic biopsy specimens. GIM is a well-known precursor lesion along the pathway to development of gastric cancer. We performed a thorough evaluation of the performance and inference time of our approach on a test set of GIM-positive and GIM-negative WSI, finding that our method successfully detects GIM in all positive WSI, with a WSI-level classification area under the receiver operating characteristic curve (AUC) of 0.98 and an average precision (AP) of 0.95. Furthermore, we show that our method can attain these metrics in under one minute on a standard CPU. Our results are applicable toward the goal of developing neural networks that can easily be deployed in clinical settings to support pathologists in quickly localizing and diagnosing small-scale morphologic features in WSI.
    Uncertainty Baselines: Benchmarks for Uncertainty & Robustness in Deep Learning. (arXiv:2106.04015v3 [cs.LG] UPDATED)
    (2 min) High-quality estimates of uncertainty and robustness are crucial for numerous real-world applications, especially for deep learning which underlies many deployed ML systems. The ability to compare techniques for improving these estimates is therefore very important for research and practice alike. Yet, competitive comparisons of methods are often lacking due to a range of reasons, including: compute availability for extensive tuning, incorporation of sufficiently many baselines, and concrete documentation for reproducibility. In this paper we introduce Uncertainty Baselines: high-quality implementations of standard and state-of-the-art deep learning methods on a variety of tasks. As of this writing, the collection spans 19 methods across 9 tasks, each with at least 5 metrics. Each baseline is a self-contained experiment pipeline with easily reusable and extendable components. Our goal is to provide immediate starting points for experimentation with new methods or applications. Additionally we provide model checkpoints, experiment outputs as Python notebooks, and leaderboards for comparing results. Code available at https://github.com/google/uncertainty-baselines.
    Algorithmic Stability and Generalization of an Unsupervised Feature Selection Algorithm. (arXiv:2010.09416v2 [cs.LG] UPDATED)
    (2 min) Feature selection, as a vital dimension reduction technique, reduces data dimension by identifying an essential subset of input features, which can facilitate interpretable insights into learning and inference processes. Algorithmic stability is a key characteristic of an algorithm regarding its sensitivity to perturbations of input samples. In this paper, we propose an innovative unsupervised feature selection algorithm attaining this stability with provable guarantees. The architecture of our algorithm consists of a feature scorer and a feature selector. The scorer trains a neural network (NN) to globally score all the features, and the selector adopts a dependent sub-NN to locally evaluate the representation abilities for selecting features. Further, we present algorithmic stability analysis and show that our algorithm has a performance guarantee via a generalization error bound. Extensive experimental results on real-world datasets demonstrate superior generalization performance of our proposed algorithm to strong baseline methods. Also, the properties revealed by our theoretical analysis and the stability of our algorithm-selected features are empirically confirmed.
    Modeling Human Driver Interactions Using an Infinite Policy Space Through Gaussian Processes. (arXiv:2201.01733v1 [cs.LG])
    (2 min) This paper proposes a method for modeling human driver interactions that relies on multi-output gaussian processes. The proposed method is developed as a refinement of the game theoretical hierarchical reasoning approach called "level-k reasoning" which conventionally assigns discrete levels of behaviors to agents. Although it is shown to be an effective modeling tool, the level-k reasoning approach may pose undesired constraints for predicting human decision making due to a limited number (usually 2 or 3) of driver policies it extracts. The proposed approach is put forward to fill this gap in the literature by introducing a continuous domain framework that enables an infinite policy space. By using the approach presented in this paper, more accurate driver models can be obtained, which can then be employed for creating high fidelity simulation platforms for the validation of autonomous vehicle control algorithms. The proposed method is validated on a real traffic dataset and compared with the conventional level-k approach to demonstrate its contributions and implications.
    Robust Self-Supervised Audio-Visual Speech Recognition. (arXiv:2201.01763v1 [cs.SD])
    (2 min) Audio-based automatic speech recognition (ASR) degrades significantly in noisy environments and is particularly vulnerable to interfering speech, as the model cannot determine which speaker to transcribe. Audio-visual speech recognition (AVSR) systems improve robustness by complementing the audio stream with the visual information that is invariant to noise and helps the model focus on the desired speaker. However, previous AVSR work focused solely on the supervised learning setup; hence the progress was hindered by the amount of labeled data available. In this work, we present a self-supervised AVSR framework built upon Audio-Visual HuBERT (AV-HuBERT), a state-of-the-art audio-visual speech representation learning model. On the largest available AVSR benchmark dataset LRS3, our approach outperforms prior state-of-the-art by ~50% (28.0% vs. 14.1%) using less than 10% of labeled data (433hr vs. 30hr) in the presence of babble noise, while reducing the WER of an audio-based model by over 75% (25.8% vs. 5.8%) on average.
    Aspis: A Robust Detection System for Distributed Learning. (arXiv:2108.02416v2 [cs.LG] UPDATED)
    (0 min) State-of-the-art machine learning models are routinely trained on large-scale distributed clusters. Crucially, such systems can be compromised when some of the computing devices exhibit abnormal (Byzantine) behavior and return arbitrary results to the parameter server (PS). This behavior may be attributed to a plethora of reasons, including system failures and orchestrated attacks. Existing work suggests robust aggregation and/or computational redundancy to alleviate the effect of distorted gradients. However, most of these schemes are ineffective when an adversary knows the task assignment and can choose the attacked workers judiciously to induce maximal damage. Our proposed method Aspis assigns gradient computations to worker nodes using a subset-based assignment which allows for multiple consistency checks on the behavior of a worker node. Examination of the calculated gradients and post-processing (clique-finding in an appropriately constructed graph) by the central node allows for efficient detection and subsequent exclusion of adversaries from the training process. We prove the Byzantine resilience and detection guarantees of Aspis under weak and strong attacks and extensively evaluate the system on various large-scale training scenarios. The principal metric for our experiments is the test accuracy, for which we demonstrate a significant improvement of about 30% compared to many state-of-the-art approaches on the CIFAR-10 dataset. The corresponding reduction of the fraction of corrupted gradients ranges from 16% to 99%.
    Morphology Decoder: A Machine Learning Guided 3D Vision Quantifying Heterogenous Rock Permeability for Planetary Surveillance and Robotic Functions. (arXiv:2111.13460v2 [cs.CV] UPDATED)
    (0 min) Permeability has a dominant influence on the flow properties of a natural fluid. Lattice Boltzmann simulator determines permeability from the nano and micropore network. The simulator holds millions of flow dynamics calculations with its accumulated errors and high consumption of computing power. To efficiently and consistently predict permeability, we propose a morphology decoder, a parallel and serial flow reconstruction of machine learning segmented heterogeneous Cretaceous texture from 3D micro computerized tomography and nuclear magnetic resonance images. For 3D vision, we introduce controllable-measurable-volume as new supervised segmentation, in which a unique set of voxel intensity corresponds to grain and pore throat sizes. The morphology decoder demarks and aggregates the morphologies boundaries in a novel way to produce permeability. Morphology decoder method consists of five novel processes, which describes in this paper, these novel processes are: (1) Geometrical 3D Permeability, (2) Machine Learning guided 3D Properties Recognition of Rock Morphology, (3) 3D Image Properties Integration Model for Permeability, (4) MRI Permeability Imager, and (5) Morphology Decoder (the process that integrates the other four novel processes).
    Conditional Imitation Learning for Multi-Agent Games. (arXiv:2201.01448v1 [cs.LG])
    (0 min) While advances in multi-agent learning have enabled the training of increasingly complex agents, most existing techniques produce a final policy that is not designed to adapt to a new partner's strategy. However, we would like our AI agents to adjust their strategy based on the strategies of those around them. In this work, we study the problem of conditional multi-agent imitation learning, where we have access to joint trajectory demonstrations at training time, and we must interact with and adapt to new partners at test time. This setting is challenging because we must infer a new partner's strategy and adapt our policy to that strategy, all without knowledge of the environment reward or dynamics. We formalize this problem of conditional multi-agent imitation learning, and propose a novel approach to address the difficulties of scalability and data scarcity. Our key insight is that variations across partners in multi-agent games are often highly structured, and can be represented via a low-rank subspace. Leveraging tools from tensor decomposition, our model learns a low-rank subspace over ego and partner agent strategies, then infers and adapts to a new partner strategy by interpolating in the subspace. We experiments with a mix of collaborative tasks, including bandits, particle, and Hanabi environments. Additionally, we test our conditional policies against real human partners in a user study on the Overcooked game. Our model adapts better to new partners compared to baselines, and robustly handles diverse settings ranging from discrete/continuous actions and static/online evaluation with AI/human partners.
    The Neural Coding Framework for Learning Generative Models. (arXiv:2012.03405v4 [cs.LG] UPDATED)
    (0 min) Neural generative models can be used to learn complex probability distributions from data, to sample from them, and to produce probability density estimates. We propose a computational framework for developing neural generative models inspired by the theory of predictive processing in the brain. According to predictive processing theory, the neurons in the brain form a hierarchy in which neurons in one level form expectations about sensory inputs from another level. These neurons update their local models based on differences between their expectations and the observed signals. In a similar way, artificial neurons in our generative models predict what neighboring neurons will do, and adjust their parameters based on how well the predictions matched reality. In this work, we show that the neural generative models learned within our framework perform well in practice across several benchmark datasets and metrics and either remain competitive with or significantly outperform other generative models with similar functionality (such as the variational auto-encoder).
    Challenges of Artificial Intelligence -- From Machine Learning and Computer Vision to Emotional Intelligence. (arXiv:2201.01466v1 [cs.AI])
    (0 min) Artificial intelligence (AI) has become a part of everyday conversation and our lives. It is considered as the new electricity that is revolutionizing the world. AI is heavily invested in both industry and academy. However, there is also a lot of hype in the current AI debate. AI based on so-called deep learning has achieved impressive results in many problems, but its limits are already visible. AI has been under research since the 1940s, and the industry has seen many ups and downs due to over-expectations and related disappointments that have followed. The purpose of this book is to give a realistic picture of AI, its history, its potential and limitations. We believe that AI is a helper, not a ruler of humans. We begin by describing what AI is and how it has evolved over the decades. After fundamentals, we explain the importance of massive data for the current mainstream of artificial intelligence. The most common representations for AI, methods, and machine learning are covered. In addition, the main application areas are introduced. Computer vision has been central to the development of AI. The book provides a general introduction to computer vision, and includes an exposure to the results and applications of our own research. Emotions are central to human intelligence, but little use has been made in AI. We present the basics of emotional intelligence and our own research on the topic. We discuss super-intelligence that transcends human understanding, explaining why such achievement seems impossible on the basis of present knowledge,and how AI could be improved. Finally, a summary is made of the current state of AI and what to do in the future. In the appendix, we look at the development of AI education, especially from the perspective of contents at our own university.
    Supervised Permutation Invariant Networks for Solving the CVRP with Bounded Fleet Size. (arXiv:2201.01529v1 [cs.LG])
    (0 min) Learning to solve combinatorial optimization problems, such as the vehicle routing problem, offers great computational advantages over classical operations research solvers and heuristics. The recently developed deep reinforcement learning approaches either improve an initially given solution iteratively or sequentially construct a set of individual tours. However, most of the existing learning-based approaches are not able to work for a fixed number of vehicles and thus bypass the complex assignment problem of the customers onto an apriori given number of available vehicles. On the other hand, this makes them less suitable for real applications, as many logistic service providers rely on solutions provided for a specific bounded fleet size and cannot accommodate short term changes to the number of vehicles. In contrast we propose a powerful supervised deep learning framework that constructs a complete tour plan from scratch while respecting an apriori fixed number of available vehicles. In combination with an efficient post-processing scheme, our supervised approach is not only much faster and easier to train but also achieves competitive results that incorporate the practical aspect of vehicle costs. In thorough controlled experiments we compare our method to multiple state-of-the-art approaches where we demonstrate stable performance, while utilizing less vehicles and shed some light on existent inconsistencies in the experimentation protocols of the related work.
    Relationship extraction for knowledge graph creation from biomedical literature. (arXiv:2201.01647v1 [cs.AI])
    (0 min) Biomedical research is growing in such an exponential pace that scientists, researchers and practitioners are no more able to cope with the amount of published literature in the domain. The knowledge presented in the literature needs to be systematized in such a ways that claims and hypothesis can be easily found, accessed and validated. Knowledge graphs can provide such framework for semantic knowledge representation from literature. However, in order to build knowledge graph, it is necessary to extract knowledge in form of relationships between biomedical entities and normalize both entities and relationship types. In this paper, we present and compare few rule-based and machine learning-based (Naive Bayes, Random Forests as examples of traditional machine learning methods and T5-based model as an example of modern deep learning) methods for scalable relationship extraction from biomedical literature for the integration into the knowledge graphs. We examine how resilient are these various methods to unbalanced and fairly small datasets, showing that T5 model handles well both small datasets, due to its pre-training on large C4 dataset as well as unbalanced data. The best performing model was T5 model fine-tuned on balanced data, with reported F1-score of 0.88.
    Stein Latent Optimization for Generative Adversarial Networks. (arXiv:2106.05319v4 [cs.LG] UPDATED)
    (0 min) Generative adversarial networks (GANs) with clustered latent spaces can perform conditional generation in a completely unsupervised manner. In the real world, the salient attributes of unlabeled data can be imbalanced. However, most of existing unsupervised conditional GANs cannot cluster attributes of these data in their latent spaces properly because they assume uniform distributions of the attributes. To address this problem, we theoretically derive Stein latent optimization that provides reparameterizable gradient estimations of the latent distribution parameters assuming a Gaussian mixture prior in a continuous latent space. Structurally, we introduce an encoder network and novel unsupervised conditional contrastive loss to ensure that data generated from a single mixture component represent a single attribute. We confirm that the proposed method, named Stein Latent Optimization for GANs (SLOGAN), successfully learns balanced or imbalanced attributes and achieves state-of-the-art unsupervised conditional generation performance even in the absence of attribute information (e.g., the imbalance ratio). Moreover, we demonstrate that the attributes to be learned can be manipulated using a small amount of probe data.
    Inferring dark matter substructure with astrometric lensing beyond the power spectrum. (arXiv:2110.01620v2 [astro-ph.CO] UPDATED)
    (0 min) Astrometry -- the precise measurement of positions and motions of celestial objects -- has emerged as a promising avenue for characterizing the dark matter population in our Galaxy. By leveraging recent advances in simulation-based inference and neural network architectures, we introduce a novel method to search for global dark matter-induced gravitational lensing signatures in astrometric datasets. Our method based on neural likelihood-ratio estimation shows significantly enhanced sensitivity to a cold dark matter population and more favorable scaling with measurement noise compared to existing approaches based on two-point correlation statistics. We demonstrate the real-world viability of our method by showing it to be robust to non-trivial modeled as well as unmodeled noise features expected in astrometric measurements. This establishes machine learning as a powerful tool for characterizing dark matter using astrometric data.
    Probing TryOnGAN. (arXiv:2201.01703v1 [cs.CV])
    (0 min) TryOnGAN is a recent virtual try-on approach, which generates highly realistic images and outperforms most previous approaches. In this article, we reproduce the TryOnGAN implementation and probe it along diverse angles: impact of transfer learning, variants of conditioning image generation with poses and properties of latent space interpolation. Some of these facets have never been explored in literature earlier. We find that transfer helps training initially but gains are lost as models train longer and pose conditioning via concatenation performs better. The latent space self-disentangles the pose and the style features and enables style transfer across poses. Our code and models are available in open source.
    exploRNN: Understanding Recurrent Neural Networks through Visual Exploration. (arXiv:2012.06326v2 [cs.LG] UPDATED)
    (0 min) Due to the success of deep learning (DL) and its growing job market, students and researchers from many areas are interested in learning about DL technologies. Visualization has proven to be of great help during this learning process. While most current educational visualizations are targeted towards one specific architecture or use case, recurrent neural networks (RNNs), which are capable of processing sequential data, are not covered yet. This is despite the fact that tasks on sequential data, such as text and function analysis, are at the forefront of DL research. Therefore, we propose exploRNN, the first interactively explorable educational visualization for RNNs. On the basis of making learning easier and more fun, we define educational objectives targeted towards understanding RNNs. We use these objectives to form guidelines for the visual design process. By means of exploRNN, which is accessible online, we provide an overview of the training process of RNNs at a coarse level, while also allowing a detailed inspection of the data flow within LSTM cells. In an empirical study, we assessed 37 subjects in a between-subjects design to investigate the learning outcomes and cognitive load of exploRNN compared to a classic text-based learning environment. While learners in the text group are ahead in superficial knowledge acquisition, exploRNN is particularly helpful for deeper understanding of the learning content. In addition, the complex content in exploRNN is perceived as significantly easier and causes less extraneous load than in the text group. The study shows that for difficult learning material such as recurrent networks, where deep understanding is important, interactive visualizations such as exploRNN can be helpful.
    How Does Knowledge Graph Embedding Extrapolate to Unseen Data: a Semantic Evidence View. (arXiv:2109.11800v2 [cs.CL] UPDATED)
    (0 min) Knowledge Graph Embedding (KGE) aims to learn representations for entities and relations. Most KGE models have gained great success, especially on extrapolation scenarios. Specifically, given an unseen triple (h, r, t), a trained model can still correctly predict t from (h, r, ?), or h from (?, r, t), such extrapolation ability is impressive. However, most existing KGE works focus on the design of delicate triple modeling function, which mainly tells us how to measure the plausibility of observed triples, but offers limited explanation of why the methods can extrapolate to unseen data, and what are the important factors to help KGE extrapolate. Therefore in this work, we attempt to study the KGE extrapolation of two problems: 1. How does KGE extrapolate to unseen data? 2. How to design the KGE model with better extrapolation ability? For the problem 1, we first discuss the impact factors for extrapolation and from relation, entity and triple level respectively, propose three Semantic Evidences (SEs), which can be observed from train set and provide important semantic information for extrapolation. Then we verify the effectiveness of SEs through extensive experiments on several typical KGE methods. For the problem 2, to make better use of the three levels of SE, we propose a novel GNN-based KGE model, called Semantic Evidence aware Graph Neural Network (SE-GNN). In SE-GNN, each level of SE is modeled explicitly by the corresponding neighbor pattern, and merged sufficiently by the multi-layer aggregation, which contributes to obtaining more extrapolative knowledge representation. Finally, through extensive experiments on FB15k-237 and WN18RR datasets, we show that SE-GNN achieves state-of-the-art performance on Knowledge Graph Completion task and performs a better extrapolation ability.
    Approximate Spectral Decomposition of Fisher Information Matrix for Simple ReLU Networks. (arXiv:2111.15256v2 [cs.LG] UPDATED)
    (0 min) We argue the Fisher information matrix (FIM) of one hidden layer networks with the ReLU activation function. Let $W$ denote the $d \times p$ weight matrix from the $d$-dimensional input to the hidden layer consisting of $p$ neurons, and $v$ the $p$-dimensional weight vector from the hidden layer to the scalar output. We focus on the FIM of $v$, which we denote as $I$. When $p$ is large, under certain conditions, the following approximately holds. 1) There are three major clusters in the eigenvalue distribution. 2) Since $I$ is non-negative owing to the ReLU, the first eigenvalue is the Perron-Frobenius eigenvalue. 3) For the cluster of the next maximum values, the eigenspace is spanned by the row vectors of $W$. 4) The direct sum of the eigenspace of the first eigenvalue and that of the third cluster is spanned by the set of all the vectors obtained as the Hadamard product of any pair of the row vectors of $W$. We confirmed by numerical simulation that the above is approximately correct when the number of hidden nodes is about 10000.
    Debiased Learning from Naturally Imbalanced Pseudo-Labels for Zero-Shot and Semi-Supervised Learning. (arXiv:2201.01490v1 [cs.LG])
    (0 min) This work studies the bias issue of pseudo-labeling, a natural phenomenon that widely occurs but often overlooked by prior research. Pseudo-labels are generated when a classifier trained on source data is transferred to unlabeled target data. We observe heavy long-tailed pseudo-labels when a semi-supervised learning model FixMatch predicts labels on the unlabeled set even though the unlabeled data is curated to be balanced. Without intervention, the training model inherits the bias from the pseudo-labels and end up being sub-optimal. To eliminate the model bias, we propose a simple yet effective method DebiasMatch, comprising of an adaptive debiasing module and an adaptive marginal loss. The strength of debiasing and the size of margins can be automatically adjusted by making use of an online updated queue. Benchmarked on ImageNet-1K, DebiasMatch significantly outperforms previous state-of-the-arts by more than 26% and 8.7% on semi-supervised learning (0.2% annotated data) and zero-shot learning tasks respectively.
    Fractal Autoencoders for Feature Selection. (arXiv:2010.09430v2 [cs.LG] UPDATED)
    (0 min) Feature selection reduces the dimensionality of data by identifying a subset of the most informative features. In this paper, we propose an innovative framework for unsupervised feature selection, called fractal autoencoders (FAE). It trains a neural network to pinpoint informative features for global exploring of representability and for local excavating of diversity. Architecturally, FAE extends autoencoders by adding a one-to-one scoring layer and a small sub-neural network for feature selection in an unsupervised fashion. With such a concise architecture, FAE achieves state-of-the-art performances; extensive experimental results on fourteen datasets, including very high-dimensional data, have demonstrated the superiority of FAE over existing contemporary methods for unsupervised feature selection. In particular, FAE exhibits substantial advantages on gene expression data exploration, reducing measurement cost by about $15$\% over the widely used L1000 landmark genes. Further, we show that the FAE framework is easily extensible with an application.
    Dynamic GPU Energy Optimization for Machine Learning Training Workloads. (arXiv:2201.01684v1 [cs.DC])
    (0 min) GPUs are widely used to accelerate the training of machine learning workloads. As modern machine learning models become increasingly larger, they require a longer time to train, leading to higher GPU energy consumption. This paper presents GPOEO, an online GPU energy optimization framework for machine learning training workloads. GPOEO dynamically determines the optimal energy configuration by employing novel techniques for online measurement, multi-objective prediction modeling, and search optimization. To characterize the target workload behavior, GPOEO utilizes GPU performance counters. To reduce the performance counter profiling overhead, it uses an analytical model to detect the training iteration change and only collects performance counter data when an iteration shift is detected. GPOEO employs multi-objective models based on gradient boosting and a local search algorithm to find a trade-off between execution time and energy consumption. We evaluate the GPOEO by applying it to 71 machine learning workloads from two AI benchmark suites running on an NVIDIA RTX3080Ti GPU. Compared with the NVIDIA default scheduling strategy, GPOEO delivers a mean energy saving of 16.2% with a modest average execution time increase of 5.1%.
    Optimal Transport using GANs for Lineage Tracing. (arXiv:2007.12098v3 [cs.LG] UPDATED)
    (2 min) In this paper, we present Super-OT, a novel approach to computational lineage tracing that combines a supervised learning framework with optimal transport based on Generative Adversarial Networks (GANs). Unlike previous approaches to lineage tracing, Super-OT has the flexibility to integrate paired data. We benchmark Super-OT based on single-cell RNA-seq data against Waddington-OT, a popular approach for lineage tracing that also employs optimal transport. We show that Super-OT achieves gains over Waddington-OT in predicting the class outcome of cells during differentiation, since it allows the integration of additional information during training.
    Sample Efficient Deep Reinforcement Learning via Uncertainty Estimation. (arXiv:2201.01666v1 [cs.LG])
    (2 min) In model-free deep reinforcement learning (RL) algorithms, using noisy value estimates to supervise policy evaluation and optimization is detrimental to the sample efficiency. As this noise is heteroscedastic, its effects can be mitigated using uncertainty-based weights in the optimization process. Previous methods rely on sampled ensembles, which do not capture all aspects of uncertainty. We provide a systematic analysis of the sources of uncertainty in the noisy supervision that occurs in RL, and introduce inverse-variance RL, a Bayesian framework which combines probabilistic ensembles and Batch Inverse Variance weighting. We propose a method whereby two complementary uncertainty estimation methods account for both the Q-value and the environment stochasticity to better mitigate the negative impacts of noisy supervision. Our results show significant improvement in terms of sample efficiency on discrete and continuous control tasks.
    Asymptotics of $\ell_2$ Regularized Network Embeddings. (arXiv:2201.01689v1 [stat.ML])
    (2 min) A common approach to solving tasks, such as node classification or link prediction, on a large network begins by learning a Euclidean embedding of the nodes of the network, from which regular machine learning methods can be applied. For unsupervised random walk methods such as DeepWalk and node2vec, adding a $\ell_2$ penalty on the embedding vectors to the loss leads to improved downstream task performance. In this paper we study the effects of this regularization and prove that, under exchangeability assumptions on the graph, it asymptotically leads to learning a nuclear-norm-type penalized graphon. In particular, the exact form of the penalty depends on the choice of subsampling method used within stochastic gradient descent to learn the embeddings. We also illustrate empirically that concatenating node covariates to $\ell_2$ regularized node2vec embeddings leads to comparable, if not superior, performance to methods which incorporate node covariates and the network structure in a non-linear manner.
    Approximation capabilities of measure-preserving neural networks. (arXiv:2106.10911v2 [cs.LG] UPDATED)
    (2 min) Measure-preserving neural networks are well-developed invertible models, however, their approximation capabilities remain unexplored. This paper rigorously analyses the approximation capabilities of existing measure-preserving neural networks including NICE and RevNets. It is shown that for compact $U \subset \R^D$ with $D\geq 2$, the measure-preserving neural networks are able to approximate arbitrary measure-preserving map $\psi: U\to \R^D$ which is bounded and injective in the $L^p$-norm. In particular, any continuously differentiable injective map with $\pm 1$ determinant of Jacobian are measure-preserving, thus can be approximated.
    Interpretable machine learning for high-dimensional trajectories of aging health. (arXiv:2105.03410v2 [q-bio.QM] UPDATED)
    (2 min) We have built a computational model for individual aging trajectories of health and survival, which contains physical, functional, and biological variables, and is conditioned on demographic, lifestyle, and medical background information. We combine techniques of modern machine learning with an interpretable interaction network, where health variables are coupled by explicit pair-wise interactions within a stochastic dynamical system. Our dynamic joint interpretable network (DJIN) model is scalable to large longitudinal data sets, is predictive of individual high-dimensional health trajectories and survival from baseline health states, and infers an interpretable network of directed interactions between the health variables. The network identifies plausible physiological connections between health variables as well as clusters of strongly connected health variables. We use English Longitudinal Study of Aging (ELSA) data to train our model and show that it performs better than multiple dedicated linear models for health outcomes and survival. We compare our model with flexible lower-dimensional latent-space models to explore the dimensionality required to accurately model aging health outcomes. Our DJIN model can be used to generate synthetic individuals that age realistically, to impute missing data, and to simulate future aging outcomes given arbitrary initial health states.
    Deep Reinforcement Learning at the Edge of the Statistical Precipice. (arXiv:2108.13264v4 [cs.LG] UPDATED)
    (3 min) Deep reinforcement learning (RL) algorithms are predominantly evaluated by comparing their relative performance on a large suite of tasks. Most published results on deep RL benchmarks compare point estimates of aggregate performance such as mean and median scores across tasks, ignoring the statistical uncertainty implied by the use of a finite number of training runs. Beginning with the Arcade Learning Environment (ALE), the shift towards computationally-demanding benchmarks has led to the practice of evaluating only a small number of runs per task, exacerbating the statistical uncertainty in point estimates. In this paper, we argue that reliable evaluation in the few run deep RL regime cannot ignore the uncertainty in results without running the risk of slowing down progress in the field. We illustrate this point using a case study on the Atari 100k benchmark, where we find substantial discrepancies between conclusions drawn from point estimates alone versus a more thorough statistical analysis. With the aim of increasing the field's confidence in reported results with a handful of runs, we advocate for reporting interval estimates of aggregate performance and propose performance profiles to account for the variability in results, as well as present more robust and efficient aggregate metrics, such as interquartile mean scores, to achieve small uncertainty in results. Using such statistical tools, we scrutinize performance evaluations of existing algorithms on other widely used RL benchmarks including the ALE, Procgen, and the DeepMind Control Suite, again revealing discrepancies in prior comparisons. Our findings call for a change in how we evaluate performance in deep RL, for which we present a more rigorous evaluation methodology, accompanied with an open-source library rliable, to prevent unreliable results from stagnating the field.
    Constrained Optimization to Train Neural Networks on Critical and Under-Represented Classes. (arXiv:2102.12894v4 [cs.LG] UPDATED)
    (3 min) Deep neural networks (DNNs) are notorious for making more mistakes for the classes that have substantially fewer samples than the others during training. Such class imbalance is ubiquitous in clinical applications and very crucial to handle because the classes with fewer samples most often correspond to critical cases (e.g., cancer) where misclassifications can have severe consequences. Not to miss such cases, binary classifiers need to be operated at high True Positive Rates (TPRs) by setting a higher threshold, but this comes at the cost of very high False Positive Rates (FPRs) for problems with class imbalance. Existing methods for learning under class imbalance most often do not take this into account. We argue that prediction accuracy should be improved by emphasizing reducing FPRs at high TPRs for problems where misclassification of the positive, i.e. critical, class samples are associated with higher cost. To this end, we pose the training of a DNN for binary classification as a constrained optimization problem and introduce a novel constraint that can be used with existing loss functions to enforce maximal area under the ROC curve (AUC) through prioritizing FPR reduction at high TPR. We solve the resulting constrained optimization problem using an Augmented Lagrangian method (ALM). Going beyond binary, we also propose two possible extensions of the proposed constraint for multi-class classification problems. We present experimental results for image-based binary and multi-class classification applications using an in-house medical imaging dataset, CIFAR10, and CIFAR100. Our results demonstrate that the proposed method improves the baselines in majority of the cases by attaining higher accuracy on critical classes while reducing the misclassification rate for the non-critical class samples.
    Astraea: Grammar-based Fairness Testing. (arXiv:2010.02542v4 [cs.SE] UPDATED)
    (2 min) Software often produces biased outputs. In particular, machine learning (ML) based software are known to produce erroneous predictions when processing discriminatory inputs. Such unfair program behavior can be caused by societal bias. In the last few years, Amazon, Microsoft and Google have provided software services that produce unfair outputs, mostly due to societal bias (e.g. gender or race). In such events, developers are saddled with the task of conducting fairness testing. Fairness testing is challenging; developers are tasked with generating discriminatory inputs that reveal and explain biases. We propose a grammar-based fairness testing approach (called ASTRAEA) which leverages context-free grammars to generate discriminatory inputs that reveal fairness violations in software systems. Using probabilistic grammars, ASTRAEA also provides fault diagnosis by isolating the cause of observed software bias. ASTRAEA's diagnoses facilitate the improvement of ML fairness. ASTRAEA was evaluated on 18 software systems that provide three major natural language processing (NLP) services. In our evaluation, ASTRAEA generated fairness violations with a rate of ~18%. ASTRAEA generated over 573K discriminatory test cases and found over 102K fairness violations. Furthermore, ASTRAEA improves software fairness by ~76%, via model-retraining.
    Exemplar-free Class Incremental Learning via Discriminative and Comparable One-class Classifiers. (arXiv:2201.01488v1 [cs.LG])
    (2 min) The exemplar-free class incremental learning requires classification models to learn new class knowledge incrementally without retaining any old samples. Recently, the framework based on parallel one-class classifiers (POC), which trains a one-class classifier (OCC) independently for each category, has attracted extensive attention, since it can naturally avoid catastrophic forgetting. POC, however, suffers from weak discriminability and comparability due to its independent training strategy for different OOCs. To meet this challenge, we propose a new framework, named Discriminative and Comparable One-class classifiers for Incremental Learning (DisCOIL). DisCOIL follows the basic principle of POC, but it adopts variational auto-encoders (VAE) instead of other well-established one-class classifiers (e.g. deep SVDD), because a trained VAE can not only identify the probability of an input sample belonging to a class but also generate pseudo samples of the class to assist in learning new tasks. With this advantage, DisCOIL trains a new-class VAE in contrast with the old-class VAEs, which forces the new-class VAE to reconstruct better for new-class samples but worse for the old-class pseudo samples, thus enhancing the comparability. Furthermore, DisCOIL introduces a hinge reconstruction loss to ensure the discriminability. We evaluate our method extensively on MNIST, CIFAR10, and Tiny-ImageNet. The experimental results show that DisCOIL achieves state-of-the-art performance.
    Reproducible and Portable Big Data Analytics in the Cloud. (arXiv:2112.09762v1 [cs.DC] CROSS LISTED)
    (2 min) Cloud computing has become a major approach to enable reproducible computational experiments because of its support of on-demand hardware and software resource provisioning. Yet there are still two main difficulties in reproducing big data applications in the cloud. The first is how to automate end-to-end execution of big data analytics in the cloud including virtual distributed environment provisioning, network and security group setup, and big data analytics pipeline description and execution. The second is an application developed for one cloud, such as AWS or Azure, is difficult to reproduce in another cloud, a.k.a. vendor lock-in problem. To tackle these problems, we leverage serverless computing and containerization techniques for automatic scalable big data application execution and reproducibility, and utilize the adapter design pattern to enable application portability and reproducibility across different clouds. Based on the approach, we propose and develop an open-source toolkit that supports 1) on-demand distributed hardware and software environment provisioning, 2) automatic data and configuration storage for each execution, 3) flexible client modes based on user preferences, 4) execution history query, and 5) simple reproducibility of existing executions in the same environment or a different environment. We did extensive experiments on both AWS and Azure using three big data analytics applications that run on a virtual CPU/GPU cluster. Three main behaviors of our toolkit were benchmarked: i) execution overhead ratio for reproducibility support, ii) differences of reproducing the same application on AWS and Azure in terms of execution time, budgetary cost and cost-performance ratio, iii) differences between scale-out and scale-up approach for the same application on AWS and Azure.
    Towards Similarity-Aware Time-Series Classification. (arXiv:2201.01413v1 [cs.LG])
    (0 min) We study time-series classification (TSC), a fundamental task of time-series data mining. Prior work has approached TSC from two major directions: (1) similarity-based methods that classify time-series based on the nearest neighbors, and (2) deep learning models that directly learn the representations for classification in a data-driven manner. Motivated by the different working mechanisms within these two research lines, we aim to connect them in such a way as to jointly model time-series similarities and learn the representations. This is a challenging task because it is unclear how we should efficiently leverage similarity information. To tackle the challenge, we propose Similarity-Aware Time-Series Classification (SimTSC), a conceptually simple and general framework that models similarity information with graph neural networks (GNNs). Specifically, we formulate TSC as a node classification problem in graphs, where the nodes correspond to time-series, and the links correspond to pair-wise similarities. We further design a graph construction strategy and a batch training algorithm with negative sampling to improve training efficiency. We instantiate SimTSC with ResNet as the backbone and Dynamic Time Warping (DTW) as the similarity measure. Extensive experiments on the full UCR datasets and several multivariate datasets demonstrate the effectiveness of incorporating similarity information into deep learning models in both supervised and semi-supervised settings. Our code is available at https://github.com/daochenzha/SimTSC
    AdaGDA: Faster Adaptive Gradient Descent Ascent Methods for Minimax Optimization. (arXiv:2106.16101v4 [math.OC] UPDATED)
    (2 min) In the paper, we propose a class of faster adaptive Gradient Descent Ascent (GDA) methods for solving the nonconvex-strongly-concave minimax problems based on unified adaptive matrices, which include almost existing coordinate-wise and global adaptive learning rates. Specifically, we propose a fast Adaptive Gradient Decent Ascent (AdaGDA) method based on the basic momentum technique, which reaches a lower gradient complexity of $O(\kappa^4\epsilon^{-4})$ for finding an $\epsilon$-stationary point without large batches, which improves the results of the existing adaptive GDA methods by a factor of $O(\sqrt{\kappa})$. At the same time, we present an accelerated version of AdaGDA (VR-AdaGDA) method based on the momentum-based variance reduced technique, which achieves a lower gradient complexity of $O(\kappa^{4.5}\epsilon^{-3})$ for finding an $\epsilon$-stationary point without large batches, which improves the results of the existing adaptive GDA methods by a factor of $O(\epsilon^{-1})$. Moreover, we prove that our VR-AdaGDA method reaches the best known gradient complexity of $O(\kappa^{3}\epsilon^{-3})$ with the mini-batch size $O(\kappa^3)$. In particular, we provide an effective convergence analysis framework for our adaptive GDA methods. Some experimental results on policy evaluation and fair classifier tasks demonstrate the efficiency of our algorithms.
    Mining Adverse Drug Reactions from Unstructured Mediums at Scale. (arXiv:2201.01405v1 [cs.CL])
    (2 min) Adverse drug reactions / events (ADR/ADE) have a major impact on patient health and health care costs. Detecting ADR's as early as possible and sharing them with regulators, pharma companies, and healthcare providers can prevent morbidity and save many lives. While most ADR's are not reported via formal channels, they are often documented in a variety of unstructured conversations such as social media posts by patients, customer support call transcripts, or CRM notes of meetings between healthcare providers and pharma sales reps. In this paper, we propose a natural language processing (NLP) solution that detects ADR's in such unstructured free-text conversations, which improves on previous work in three ways. First, a new Named Entity Recognition (NER) model obtains new state-of-the-art accuracy for ADR and Drug entity extraction on the ADE, CADEC, and SMM4H benchmark datasets (91.75%, 78.76%, and 83.41% F1 scores respectively). Second, two new Relation Extraction (RE) models are introduced - one based on BioBERT while the other utilizing crafted features over a Fully Connected Neural Network (FCNN) - are shown to perform on par with existing state-of-the-art models, and outperform them when trained with a supplementary clinician-annotated RE dataset. Third, a new text classification model, for deciding if a conversation includes an ADR, obtains new state-of-the-art accuracy on the CADEC dataset (86.69% F1 score). The complete solution is implemented as a unified NLP pipeline in a production-grade library built on top of Apache Spark, making it natively scalable and able to process millions of batch or streaming records on commodity clusters.
    Boosting Algorithms for Delivery Time Prediction in Transportation Logistics. (arXiv:2009.11598v2 [cs.LG] UPDATED)
    (0 min) Travel time is a crucial measure in transportation. Accurate travel time prediction is also fundamental for operation and advanced information systems. A variety of solutions exist for short-term travel time predictions such as solutions that utilize real-time GPS data and optimization methods to track the path of a vehicle. However, reliable long-term predictions remain challenging. We show in this paper the applicability and usefulness of travel time i.e. delivery time prediction for postal services. We investigate several methods such as linear regression models and tree based ensembles such as random forest, bagging, and boosting, that allow to predict delivery time by conducting extensive experiments and considering many usability scenarios. Results reveal that travel time prediction can help mitigate high delays in postal services. We show that some boosting algorithms, such as light gradient boosting and catboost, have a higher performance in terms of accuracy and runtime efficiency than other baselines such as linear regression models, bagging regressor and random forest.
    Using Machine Learning for Anomaly Detection on a System-on-Chip under Gamma Radiation. (arXiv:2201.01588v1 [cs.LG])
    (2 min) The emergence of new nanoscale technologies has imposed significant challenges to designing reliable electronic systems in radiation environments. A few types of radiation like Total Ionizing Dose (TID) effects often cause permanent damages on such nanoscale electronic devices, and current state-of-the-art technologies to tackle TID make use of expensive radiation-hardened devices. This paper focuses on a novel and different approach: using machine learning algorithms on consumer electronic level Field Programmable Gate Arrays (FPGAs) to tackle TID effects and monitor them to replace before they stop working. This condition has a research challenge to anticipate when the board results in a total failure due to TID effects. We observed internal measurements of the FPGA boards under gamma radiation and used three different anomaly detection machine learning (ML) algorithms to detect anomalies in the sensor measurements in a gamma-radiated environment. The statistical results show a highly significant relationship between the gamma radiation exposure levels and the board measurements. Moreover, our anomaly detection results have shown that a One-Class Support Vector Machine with Radial Basis Function Kernel has an average Recall score of 0.95. Also, all anomalies can be detected before the boards stop working.
    Regret Lower Bounds for Learning Linear Quadratic Gaussian Systems. (arXiv:2201.01680v1 [cs.LG])
    (2 min) This paper presents local minimax regret lower bounds for adaptively controlling linear-quadratic-Gaussian (LQG) systems. We consider smoothly parametrized instances and provide an understanding of when logarithmic regret is impossible which is both instance specific and flexible enough to take problem structure into account. This understanding relies on two key notions: That of local-uninformativeness; when the optimal policy does not provide sufficient excitation for identification of the optimal policy, and yields a degenerate Fisher information matrix; and that of information-regret-boundedness, when the small eigenvalues of a policy-dependent information matrix are boundable in terms of the regret of that policy. Combined with a reduction to Bayesian estimation and application of Van Trees' inequality, these two conditions are sufficient for proving regret bounds on order of magnitude $\sqrt{T}$ in the time horizon, $T$. This method yields lower bounds that exhibit tight dimensional dependencies and scale naturally with control-theoretic problem constants. For instance, we are able to prove that systems operating near marginal stability are fundamentally hard to learn to control. We further show that large classes of systems satisfy these conditions, among them any state-feedback system with both $A$- and $B$-matrices unknown. Most importantly, we also establish that a nontrivial class of partially observable systems, essentially those that are over-actuated, satisfy these conditions, thus providing a $\sqrt{T}$ lower bound also valid for partially observable systems. Finally, we turn to two simple examples which demonstrate that our lower bound captures classical control-theoretic intuition: our lower bounds diverge for systems operating near marginal stability or with large filter gain -- these can be arbitrarily hard to (learn to) control.
    Online certification of preference-based fairness for personalized recommender systems. (arXiv:2104.14527v4 [cs.LG] UPDATED)
    (0 min) Recommender systems are facing scrutiny because of their growing impact on the opportunities we have access to. Current audits for fairness are limited to coarse-grained parity assessments at the level of sensitive groups. We propose to audit for envy-freeness, a more granular criterion aligned with individual preferences: every user should prefer their recommendations to those of other users. Since auditing for envy requires to estimate the preferences of users beyond their existing recommendations, we cast the audit as a new pure exploration problem in multi-armed bandits. We propose a sample-efficient algorithm with theoretical guarantees that it does not deteriorate user experience. We also study the trade-offs achieved on real-world recommendation datasets.
    Deducing Optimal Classification Algorithm for Heterogeneous Fabric. (arXiv:2111.05558v3 [cs.LG] UPDATED)
    (0 min) For defining the optimal machine learning algorithm, the decision was not easy for which we shall choose. To help future researchers, we describe in this paper the optimal among the best of the algorithms. We built a synthetic data set and performed the supervised machine learning runs for five different algorithms. For heterogeneous rock fabric, we identified Random Forest, among others, to be the appropriate algorithm.
    Tracking Most Severe Arm Changes in Bandits. (arXiv:2112.13838v2 [cs.LG] UPDATED)
    (0 min) In bandits with distribution shifts, one aims to automatically detect an unknown number $L$ of changes in reward distribution, and restart exploration when necessary. While this problem remained open for many years, a recent breakthrough of Auer et al. (2018, 2019) provide the first adaptive procedure to guarantee an optimal (dynamic) regret $\sqrt{LT}$, for $T$ rounds, with no knowledge of $L$. However, not all distributional shifts are equally severe, e.g., suppose no best arm switches occur, then we cannot rule out that a regret $O(\sqrt{T})$ may remain possible; in other words, is it possible to achieve dynamic regret that optimally scales only with an unknown number of severe shifts? This unfortunately has remained elusive, despite various attempts (Auer et al., 2019, Foster et al., 2020). We resolve this problem in the case of two-armed bandits: we derive an adaptive procedure that guarantees a dynamic regret of order $\tilde{O}(\sqrt{\tilde{L} T})$, where $\tilde L \ll L$ captures an unknown number of severe best arm changes, i.e., with significant switches in rewards, and which last sufficiently long to actually require a restart. As a consequence, for any number $L$ of distributional shifts outside of these severe shifts, our procedure achieves regret just $\tilde{O}(\sqrt{T})\ll \tilde{O}(\sqrt{LT})$. Finally, we note that our notion of severe shift applies in both classical settings of stochastic switching bandits and of adversarial bandits.
    Physics-informed neural network method for modelling beam-wall interactions. (arXiv:2112.11323v2 [physics.acc-ph] UPDATED)
    (2 min) A mesh-free approach for modelling beam-wall interactions in particle accelerators is proposed. The key idea of our method is to use a deep neural network as a surrogate for the solution to a set of partial differential equations involving the particle beam, and the surface impedance concept. The proposed approach is applied to the coupling impedance of an accelerator vacuum chamber with thin conductive coating, and also verified in comparison with the existing analytical formula.
    TUNet: A Block-online Bandwidth Extension Model based on Transformers and Self-supervised Pretraining. (arXiv:2110.13492v2 [cs.LG] UPDATED)
    (2 min) We introduce a block-online variant of the temporal feature-wise linear modulation (TFiLM) model to achieve bandwidth extension. The proposed architecture simplifies the UNet backbone of the TFiLM to reduce inference time and employs an efficient transformer at the bottleneck to alleviate performance degradation. We also utilize self-supervised pretraining and data augmentation to enhance the quality of bandwidth extended signals and reduce the sensitivity with respect to downsampling methods. Experiment results on the VCTK dataset show that the proposed method outperforms several recent baselines in both intrusive and non-intrusive metrics. Pretraining and filter augmentation also help stabilize and enhance the overall performance.
    Understanding Entropy Coding With Asymmetric Numeral Systems (ANS): a Statistician's Perspective. (arXiv:2201.01741v1 [stat.ML])
    (2 min) Entropy coding is the backbone data compression. Novel machine-learning based compression methods often use a new entropy coder called Asymmetric Numeral Systems (ANS) [Duda et al., 2015], which provides very close to optimal bitrates and simplifies [Townsend et al., 2019] advanced compression techniques such as bits-back coding. However, researchers with a background in machine learning often struggle to understand how ANS works, which prevents them from exploiting its full versatility. This paper is meant as an educational resource to make ANS more approachable by presenting it from a new perspective of latent variable models and the so-called bits-back trick. We guide the reader step by step to a complete implementation of ANS in the Python programming language, which we then generalize for more advanced use cases. We also present and empirically evaluate an open-source library of various entropy coders designed for both research and production use. Related teaching videos and problem sets are available online.
    Bringing Your Own View: Graph Contrastive Learning without Prefabricated Data Augmentations. (arXiv:2201.01702v1 [cs.LG])
    (0 min) Self-supervision is recently surging at its new frontier of graph learning. It facilitates graph representations beneficial to downstream tasks; but its success could hinge on domain knowledge for handcraft or the often expensive trials and errors. Even its state-of-the-art representative, graph contrastive learning (GraphCL), is not completely free of those needs as GraphCL uses a prefabricated prior reflected by the ad-hoc manual selection of graph data augmentations. Our work aims at advancing GraphCL by answering the following questions: How to represent the space of graph augmented views? What principle can be relied upon to learn a prior in that space? And what framework can be constructed to learn the prior in tandem with contrastive learning? Accordingly, we have extended the prefabricated discrete prior in the augmentation set, to a learnable continuous prior in the parameter space of graph generators, assuming that graph priors per se, similar to the concept of image manifolds, can be learned by data generation. Furthermore, to form contrastive views without collapsing to trivial solutions due to the prior learnability, we have leveraged both principles of information minimization (InfoMin) and information bottleneck (InfoBN) to regularize the learned priors. Eventually, contrastive learning, InfoMin, and InfoBN are incorporated organically into one framework of bi-level optimization. Our principled and automated approach has proven to be competitive against the state-of-the-art graph self-supervision methods, including GraphCL, on benchmarks of small graphs; and shown even better generalizability on large-scale graphs, without resorting to human expertise or downstream validation. Our code is publicly released at https://github.com/Shen-Lab/GraphCL_Automated.
    Sign Language Recognition System using TensorFlow Object Detection API. (arXiv:2201.01486v1 [cs.CV])
    (0 min) Communication is defined as the act of sharing or exchanging information, ideas or feelings. To establish communication between two people, both of them are required to have knowledge and understanding of a common language. But in the case of deaf and dumb people, the means of communication are different. Deaf is the inability to hear and dumb is the inability to speak. They communicate using sign language among themselves and with normal people but normal people do not take seriously the importance of sign language. Not everyone possesses the knowledge and understanding of sign language which makes communication difficult between a normal person and a deaf and dumb person. To overcome this barrier, one can build a model based on machine learning. A model can be trained to recognize different gestures of sign language and translate them into English. This will help a lot of people in communicating and conversing with deaf and dumb people. The existing Indian Sing Language Recognition systems are designed using machine learning algorithms with single and double-handed gestures but they are not real-time. In this paper, we propose a method to create an Indian Sign Language dataset using a webcam and then using transfer learning, train a TensorFlow model to create a real-time Sign Language Recognition system. The system achieves a good level of accuracy even with a limited size dataset.
    Meta Two-Sample Testing: Learning Kernels for Testing with Limited Data. (arXiv:2106.07636v2 [stat.ML] UPDATED)
    (0 min) Modern kernel-based two-sample tests have shown great success in distinguishing complex, high-dimensional distributions with appropriate learned kernels. Previous work has demonstrated that this kernel learning procedure succeeds, assuming a considerable number of observed samples from each distribution. In realistic scenarios with very limited numbers of data samples, however, it can be challenging to identify a kernel powerful enough to distinguish complex distributions. We address this issue by introducing the problem of meta two-sample testing (M2ST), which aims to exploit (abundant) auxiliary data on related tasks to find an algorithm that can quickly identify a powerful test on new target tasks. We propose two specific algorithms for this task: a generic scheme which improves over baselines and a more tailored approach which performs even better. We provide both theoretical justification and empirical evidence that our proposed meta-testing schemes out-perform learning kernel-based tests directly from scarce observations, and identify when such schemes will be successful.
    Online Adaptation to Label Distribution Shift. (arXiv:2107.04520v2 [cs.LG] UPDATED)
    (0 min) Machine learning models often encounter distribution shifts when deployed in the real world. In this paper, we focus on adaptation to label distribution shift in the online setting, where the test-time label distribution is continually changing and the model must dynamically adapt to it without observing the true label. Leveraging a novel analysis, we show that the lack of true label does not hinder estimation of the expected test loss, which enables the reduction of online label shift adaptation to conventional online learning. Informed by this observation, we propose adaptation algorithms inspired by classical online learning techniques such as Follow The Leader (FTL) and Online Gradient Descent (OGD) and derive their regret bounds. We empirically verify our findings under both simulated and real world label distribution shifts and show that OGD is particularly effective and robust to a variety of challenging label shift scenarios.
    Augmenting astrophysical scaling relations with machine learning : application to reducing the SZ flux-mass scatter. (arXiv:2201.01305v1 [astro-ph.CO])
    (2 min) Complex systems (stars, supernovae, galaxies, and clusters) often exhibit low scatter relations between observable properties (e.g., luminosity, velocity dispersion, oscillation period, temperature). These scaling relations can illuminate the underlying physics and can provide observational tools for estimating masses and distances. Machine learning can provide a systematic way to search for new scaling relations (or for simple extensions to existing relations) in abstract high-dimensional parameter spaces. We use a machine learning tool called symbolic regression (SR), which models the patterns in a given dataset in the form of analytic equations. We focus on the Sunyaev-Zeldovich flux$-$cluster mass relation ($Y_\mathrm{SZ}-M$), the scatter in which affects inference of cosmological parameters from cluster abundance data. Using SR on the data from the IllustrisTNG hydrodynamical simulation, we find a new proxy for cluster mass which combines $Y_\mathrm{SZ}$ and concentration of ionized gas ($c_\mathrm{gas}$): $M \propto Y_\mathrm{conc}^{3/5} \equiv Y_\mathrm{SZ}^{3/5} (1-A\, c_\mathrm{gas})$. $Y_\mathrm{conc}$ reduces the scatter in the predicted $M$ by $\sim 20-30$% for large clusters ($M\gtrsim 10^{14}\, h^{-1} \, M_\odot$) at both high and low redshifts, as compared to using just $Y_\mathrm{SZ}$. We show that the dependence on $c_\mathrm{gas}$ is linked to cores of clusters exhibiting larger scatter than their outskirts. Finally, we test $Y_\mathrm{conc}$ on clusters from simulations of the CAMELS project and show that $Y_\mathrm{conc}$ is robust against variations in cosmology, astrophysics, subgrid physics, and cosmic variance. Our results and methodology can be useful for accurate multiwavelength cluster mass estimation from current and upcoming CMB and X-ray surveys like ACT, SO, SPT, eROSITA and CMB-S4.
    Corrupting Data to Remove Deceptive Perturbation: Using Preprocessing Method to Improve System Robustness. (arXiv:2201.01399v1 [cs.CV])
    (2 min) Although deep neural networks have achieved great performance on classification tasks, recent studies showed that well trained networks can be fooled by adding subtle noises. This paper introduces a new approach to improve neural network robustness by applying the recovery process on top of the naturally trained classifier. In this approach, images will be intentionally corrupted by some significant operator and then be recovered before passing through the classifiers. SARGAN -- an extension on Generative Adversarial Networks (GAN) is capable of denoising radar signals. This paper will show that SARGAN can also recover corrupted images by removing the adversarial effects. Our results show that this approach does improve the performance of naturally trained networks.
    Convergence and Complexity of Stochastic Block Majorization-Minimization. (arXiv:2201.01652v1 [math.OC])
    (2 min) Stochastic majorization-minimization (SMM) is an online extension of the classical principle of majorization-minimization, which consists of sampling i.i.d. data points from a fixed data distribution and minimizing a recursively defined majorizing surrogate of an objective function. In this paper, we introduce stochastic block majorization-minimization, where the surrogates can now be only block multi-convex and a single block is optimized at a time within a diminishing radius. Relaxing the standard strong convexity requirements for surrogates in SMM, our framework gives wider applicability including online CANDECOMP/PARAFAC (CP) dictionary learning and yields greater computational efficiency especially when the problem dimension is large. We provide an extensive convergence analysis on the proposed algorithm, which we derive under possibly dependent data streams, relaxing the standard i.i.d. assumption on data samples. We show that the proposed algorithm converges almost surely to the set of stationary points of a nonconvex objective under constraints at a rate $O((\log n)^{1+\eps}/n^{1/2})$ for the empirical loss function and $O((\log n)^{1+\eps}/n^{1/4})$ for the expected loss function, where $n$ denotes the number of data samples processed. Under some additional assumption, the latter convergence rate can be improved to $O((\log n)^{1+\eps}/n^{1/2})$. Our results provide first convergence rate bounds for various online matrix and tensor decomposition algorithms under a general Markovian data setting.
    Synthesizing Tensor Transformations for Visual Self-attention. (arXiv:2201.01410v1 [cs.CV])
    (2 min) Self-attention shows outstanding competence in capturing long-range relationships while enhancing performance on vision tasks, such as image classification and image captioning. However, the self-attention module highly relies on the dot product multiplication and dimension alignment among query-key-value features, which cause two problems: (1) The dot product multiplication results in exhaustive and redundant computation. (2) Due to the visual feature map often appearing as a multi-dimensional tensor, reshaping the scale of the tensor feature to adapt to the dimension alignment might destroy the internal structure of the tensor feature map. To address these problems, this paper proposes a self-attention plug-in module with its variants, namely, Synthesizing Tensor Transformations (STT), for directly processing image tensor features. Without computing the dot-product multiplication among query-key-value, the basic STT is composed of the tensor transformation to learn the synthetic attention weight from visual information. The effectiveness of STT series is validated on the image classification and image caption. Experiments show that the proposed STT achieves competitive performance while keeping robustness compared to self-attention based above vision tasks.
    Adaptive Online Incremental Learning for Evolving Data Streams. (arXiv:2201.01633v1 [cs.LG])
    (2 min) Recent years have witnessed growing interests in online incremental learning. However, there are three major challenges in this area. The first major difficulty is concept drift, that is, the probability distribution in the streaming data would change as the data arrives. The second major difficulty is catastrophic forgetting, that is, forgetting what we have learned before when learning new knowledge. The last one we often ignore is the learning of the latent representation. Only good latent representation can improve the prediction accuracy of the model. Our research builds on this observation and attempts to overcome these difficulties. To this end, we propose an Adaptive Online Incremental Learning for evolving data streams (AOIL). We use auto-encoder with the memory module, on the one hand, we obtained the latent features of the input, on the other hand, according to the reconstruction loss of the auto-encoder with memory module, we could successfully detect the existence of concept drift and trigger the update mechanism, adjust the model parameters in time. In addition, we divide features, which are derived from the activation of the hidden layers, into two parts, which are used to extract the common and private features respectively. By means of this approach, the model could learn the private features of the new coming instances, but do not forget what we have learned in the past (shared features), which reduces the occurrence of catastrophic forgetting. At the same time, to get the fusion feature vector we use the self-attention mechanism to effectively fuse the extracted features, which further improved the latent representation learning.
    Sample Selection with Deadline Control for Efficient Federated Learning on Heterogeneous Clients. (arXiv:2201.01601v1 [cs.LG])
    (2 min) Federated Learning (FL) trains a machine learning model on distributed clients without exposing individual data. Unlike centralized training that is usually based on carefully-organized data, FL deals with on-device data that are often unfiltered and imbalanced. As a result, conventional FL training protocol that treats all data equally leads to a waste of local computational resources and slows down the global learning process. To this end, we propose FedBalancer, a systematic FL framework that actively selects clients' training samples. Our sample selection strategy prioritizes more "informative" data while respecting privacy and computational capabilities of clients. To better utilize the sample selection to speed up global training, we further introduce an adaptive deadline control scheme that predicts the optimal deadline for each round with varying client train data. Compared with existing FL algorithms with deadline configuration methods, our evaluation on five datasets from three different domains shows that FedBalancer improves the time-to-accuracy performance by 1.22~4.62x while improving the model accuracy by 1.0~3.3%. We also show that FedBalancer is readily applicable to other FL approaches by demonstrating that FedBalancer improves the convergence speed and accuracy when operating jointly with three different FL algorithms.
    Joint Learning-Based Stabilization of Multiple Unknown Linear Systems. (arXiv:2201.01387v1 [eess.SY])
    (2 min) Learning-based control of linear systems received a lot of attentions recently. In popular settings, the true dynamical models are unknown to the decision-maker and need to be interactively learned by applying control inputs to the systems. Unlike the matured literature of efficient reinforcement learning policies for adaptive control of a single system, results on joint learning of multiple systems are not currently available. Especially, the important problem of fast and reliable joint-stabilization remains unaddressed and so is the focus of this work. We propose a novel joint learning-based stabilization algorithm for quickly learning stabilizing policies for all systems understudy, from the data of unstable state trajectories. The presented procedure is shown to be notably effective such that it stabilizes the family of dynamical systems in an extremely short time period.
    Offsetting Unequal Competition through RL-assisted Incentive Schemes. (arXiv:2201.01450v1 [cs.GT])
    (2 min) This paper investigates the dynamics of competition among organizations with unequal expertise. Multi-agent reinforcement learning has been used to simulate and understand the impact of various incentive schemes designed to offset such inequality. We design Touch-Mark, a game based on well-known multi-agent-particle-environment, where two teams (weak, strong) with unequal but changing skill levels compete against each other. For training such a game, we propose a novel controller assisted multi-agent reinforcement learning algorithm \our\, which empowers each agent with an ensemble of policies along with a supervised controller that by selectively partitioning the sample space, triggers intelligent role division among the teammates. Using C-MADDPG as an underlying framework, we propose an incentive scheme for the weak team such that the final rewards of both teams become the same. We find that in spite of the incentive, the final reward of the weak team falls short of the strong team. On inspecting, we realize that an overall incentive scheme for the weak team does not incentivize the weaker agents within that team to learn and improve. To offset this, we now specially incentivize the weaker player to learn and as a result, observe that the weak team beyond an initial phase performs at par with the stronger team. The final goal of the paper has been to formulate a dynamic incentive scheme that continuously balances the reward of the two teams. This is achieved by devising an incentive scheme enriched with an RL agent which takes minimum information from the environment.
    ZeroBERTo -- Leveraging Zero-Shot Text Classification by Topic Modeling. (arXiv:2201.01337v1 [cs.CL])
    (2 min) Traditional text classification approaches often require a good amount of labeled data, which is difficult to obtain, especially in restricted domains or less widespread languages. This lack of labeled data has led to the rise of low-resource methods, that assume low data availability in natural language processing. Among them, zero-shot learning stands out, which consists of learning a classifier without any previously labeled data. The best results reported with this approach use language models such as Transformers, but fall into two problems: high execution time and inability to handle long texts as input. This paper proposes a new model, ZeroBERTo, which leverages an unsupervised clustering step to obtain a compressed data representation before the classification task. We show that ZeroBERTo has better performance for long inputs and shorter execution time, outperforming XLM-R by about 12% in the F1 score in the FolhaUOL dataset. Keywords: Low-Resource NLP, Unlabeled data, Zero-Shot Learning, Topic Modeling, Transformers.
    Linear Variational State Space Filtering. (arXiv:2201.01353v1 [cs.LG])
    (2 min) We introduce Variational State-Space Filters (VSSF), a new method for unsupervised learning, identification, and filtering of latent Markov state space models from raw pixels. We present a theoretically sound framework for latent state space inference under heterogeneous sensor configurations. The resulting model can integrate an arbitrary subset of the sensor measurements used during training, enabling the learning of semi-supervised state representations, thus enforcing that certain components of the learned latent state space to agree with interpretable measurements. From this framework we derive L-VSSF, an explicit instantiation of this model with linear latent dynamics and Gaussian distribution parameterizations. We experimentally demonstrate L-VSSF's ability to filter in latent space beyond the sequence length of the training dataset across several different test environments.
    Learning Differentiable Safety-Critical Control using Control Barrier Functions for Generalization to Novel Environments. (arXiv:2201.01347v1 [eess.SY])
    (2 min) Control barrier functions (CBFs) have become a popular tool to enforce safety of a control system. CBFs are commonly utilized in a quadratic program formulation (CBF-QP) as safety-critical constraints. A class $\mathcal{K}$ function in CBFs usually needs to be tuned manually in order to balance the trade-off between performance and safety for each environment. However, this process is often heuristic and can become intractable for high relative-degree systems. Moreover, it prevents the CBF-QP from generalizing to different environments in the real world. By embedding the optimization procedure of the CBF-QP as a differentiable layer within a deep learning architecture, we propose a differentiable optimization-based safety-critical control framework that enables generalization to new environments with forward invariance guarantees. Finally, we validate the proposed control design with 2D double and quadruple integrator systems in various environments.
    The CAMELS project: public data release. (arXiv:2201.01300v1 [astro-ph.CO])
    (2 min) The Cosmology and Astrophysics with MachinE Learning Simulations (CAMELS) project was developed to combine cosmology with astrophysics through thousands of cosmological hydrodynamic simulations and machine learning. CAMELS contains 4,233 cosmological simulations, 2,049 N-body and 2,184 state-of-the-art hydrodynamic simulations that sample a vast volume in parameter space. In this paper we present the CAMELS public data release, describing the characteristics of the CAMELS simulations and a variety of data products generated from them, including halo, subhalo, galaxy, and void catalogues, power spectra, bispectra, Lyman-$\alpha$ spectra, probability distribution functions, halo radial profiles, and X-rays photon lists. We also release over one thousand catalogues that contain billions of galaxies from CAMELS-SAM: a large collection of N-body simulations that have been combined with the Santa Cruz Semi-Analytic Model. We release all the data, comprising more than 350 terabytes and containing 143,922 snapshots, millions of halos, galaxies and summary statistics. We provide further technical details on how to access, download, read, and process the data at \url{https://camels.readthedocs.io}.
    Semantic Communications: Principles and Challenges. (arXiv:2201.01389v1 [cs.IT])
    (2 min) Semantic communication, regarded as the breakthrough beyond Shannon paradigm, aims at the successful transmission of semantic information conveyed by the source rather than the accurate reception of each single symbol or bit regardless of its meaning. This article provides an overview on semantic communications. After a brief review on Shannon information theory, we discuss semantic communications with theory, frameworks, and system design enabled by deep learning. Different from the symbol/bit error rate used for measuring the conventional communication systems, new performance metrics for semantic communications are also discussed. The article is concluded by several open questions.
    Graph Decipher: A transparent dual-attention graph neural network to understand the message-passing mechanism for the node classification. (arXiv:2201.01381v1 [cs.LG])
    (2 min) Graph neural networks can be effectively applied to find solutions for many real-world problems across widely diverse fields. The success of graph neural networks is linked to the message-passing mechanism on the graph, however, the message-aggregating behavior is still not entirely clear in most algorithms. To improve functionality, we propose a new transparent network called Graph Decipher to investigate the message-passing mechanism by prioritizing in two main components: the graph structure and node attributes, at the graph, feature, and global levels on a graph under the node classification task. However, the computation burden now becomes the most significant issue because the relevance of both graph structure and node attributes are computed on a graph. In order to solve this issue, only relevant representative node attributes are extracted by graph feature filters, allowing calculations to be performed in a category-oriented manner. Experiments on seven datasets show that Graph Decipher achieves state-of-the-art performance while imposing a substantially lower computation burden under the node classification task. Additionally, since our algorithm has the ability to explore the representative node attributes by category, it is utilized to alleviate the imbalanced node classification problem on multi-class graph datasets.
    Test and Evaluation of Quadrupedal Walking Gaits through Sim2Real Gap Quantification. (arXiv:2201.01323v1 [eess.SY])
    (2 min) In this letter, the authors propose a two-step approach to evaluate and verify a true system's capacity to satisfy its operational objective. Specifically, whenever the system objective has a quantifiable measure of satisfaction, i.e. a signal temporal logic specification, a barrier function, etc - the authors develop two separate optimization problems solvable via a Bayesian Optimization procedure detailed within. This dual approach has the added benefit of quantifying the Sim2Real Gap between a system simulator and its hardware counterpart. Our contributions are twofold. First, we show repeatability with respect to our outlined optimization procedure in solving these optimization problems. Second, we show that the same procedure can discriminate between different environments by identifying the Sim2Real Gap between a simulator and its hardware counterpart operating in different environments.
    Towards Understanding Quality Challenges of the Federated Learning: A First Look from the Lens of Robustness. (arXiv:2201.01409v1 [cs.LG])
    (2 min) Federated learning (FL) is a widely adopted distributed learning paradigm in practice, which intends to preserve users' data privacy while leveraging the entire dataset of all participants for training. In FL, multiple models are trained independently on the users and aggregated centrally to update a global model in an iterative process. Although this approach is excellent at preserving privacy by design, FL still tends to suffer from quality issues such as attacks or byzantine faults. Some recent attempts have been made to address such quality challenges on the robust aggregation techniques for FL. However, the effectiveness of state-of-the-art (SOTA) robust FL techniques is still unclear and lacks a comprehensive study. Therefore, to better understand the current quality status and challenges of these SOTA FL techniques in the presence of attacks and faults, in this paper, we perform a large-scale empirical study to investigate the SOTA FL's quality from multiple angles of attacks, simulated faults (via mutation operators), and aggregation (defense) methods. In particular, we perform our study on two generic image datasets and one real-world federated medical image dataset. We also systematically investigate the effect of the distribution of attacks/faults over users and the independent and identically distributed (IID) factors, per dataset, on the robustness results. After a large-scale analysis with 496 configurations, we find that most mutators on each individual user have a negligible effect on the final model. Moreover, choosing the most robust FL aggregator depends on the attacks and datasets. Finally, we illustrate that it is possible to achieve a generic solution that works almost as well or even better than any single aggregator on all attacks and configurations with a simple ensemble model of aggregators.
    Latent Vector Expansion using Autoencoder for Anomaly Detection. (arXiv:2201.01416v1 [cs.CV])
    (2 min) Deep learning methods can classify various unstructured data such as images, language, and voice as input data. As the task of classifying anomalies becomes more important in the real world, various methods exist for classifying using deep learning with data collected in the real world. As the task of classifying anomalies becomes more important in the real world, there are various methods for classifying using deep learning with data collected in the real world. Among the various methods, the representative approach is a method of extracting and learning the main features based on a transition model from pre-trained models, and a method of learning an autoencoderbased structure only with normal data and classifying it as abnormal through a threshold value. However, if the dataset is imbalanced, even the state-of-the-arts models do not achieve good performance. This can be addressed by augmenting normal and abnormal features in imbalanced data as features with strong distinction. We use the features of the autoencoder to train latent vectors from low to high dimensionality. We train normal and abnormal data as a feature that has a strong distinction among the features of imbalanced data. We propose a latent vector expansion autoencoder model that improves classification performance at imbalanced data. The proposed method shows performance improvement compared to the basic autoencoder using imbalanced anomaly dataset.
    Deep learning for location based beamforming with NLOS channels. (arXiv:2201.01386v1 [cs.IT])
    (2 min) Massive MIMO systems are highly efficient but critically rely on accurate channel state information (CSI) at the base station in order to determine appropriate precoders. CSI acquisition requires sending pilot symbols which induce an important overhead. In this paper, a method whose objective is to determine an appropriate precoder from the knowledge of the user's location only is proposed. Such a way to determine precoders is known as location based beamforming. It allows to reduce or even eliminate the need for pilot symbols, depending on how the location is obtained. the proposed method learns a direct mapping from location to precoder in a supervised way. It involves a neural network with a specific structure based on random Fourier features allowing to learn functions containing high spatial frequencies. It is assessed empirically and yields promising results on realistic synthetic channels. As opposed to previously proposed methods, it allows to handle both line-of-sight (LOS) and non-line-of-sight (NLOS) channels.
    Balsa: Learning a Query Optimizer Without Expert Demonstrations. (arXiv:2201.01441v1 [cs.DB])
    (2 min) Query optimizers are a performance-critical component in every database system. Due to their complexity, optimizers take experts months to write and years to refine. In this work, we demonstrate for the first time that learning to optimize queries without learning from an expert optimizer is both possible and efficient. We present Balsa, a query optimizer built by deep reinforcement learning. Balsa first learns basic knowledge from a simple, environment-agnostic simulator, followed by safe learning in real execution. On the Join Order Benchmark, Balsa matches the performance of two expert query optimizers, both open-source and commercial, with two hours of learning, and outperforms them by up to 2.8$\times$ in workload runtime after a few more hours. Balsa thus opens the possibility of automatically learning to optimize in future compute environments where expert-designed optimizers do not exist.
    Problem-dependent attention and effort in neural networks with an application to image resolution. (arXiv:2201.01415v1 [cs.CV])
    (2 min) This paper introduces a new neural network-based estimation approach that is inspired by the biological phenomenon whereby humans and animals vary the levels of attention and effort that they dedicate to a problem depending upon its difficulty. The proposed approach leverages alternate models' internal levels of confidence in their own projections. If the least costly model is confident in its classification, then that is the classification used; if not, the model with the next lowest cost of implementation is run, and so on. This use of successively more complex models -- together with the models' internal propensity scores to evaluate their likelihood of being correct -- makes it possible to substantially reduce resource use while maintaining high standards for classification accuracy. The approach is applied to the digit recognition problem from Google's Street View House Numbers dataset, using Multilayer Perceptron (MLP) neural networks trained on high- and low-resolution versions of the digit images. The algorithm examines the low-resolution images first, only moving to higher resolution images if the classification from the initial low-resolution pass does not have a high degree of confidence. For the MLPs considered here, this sequential approach enables a reduction in resource usage of more than 50\% without any sacrifice in classification accuracy.
    Bridging Adversarial and Nonstationary Multi-armed Bandit. (arXiv:2201.01628v1 [cs.LG])
    (2 min) In the multi-armed bandit framework, there are two formulations that are commonly employed to handle time-varying reward distributions: adversarial bandit and nonstationary bandit. Although their oracles, algorithms, and regret analysis differ significantly, we provide a unified formulation in this paper that smoothly bridges the two as special cases. The formulation uses an oracle that takes the best-fixed arm within time windows. Depending on the window size, it turns into the oracle in hindsight in the adversarial bandit and dynamic oracle in the nonstationary bandit. We provide algorithms that attain the optimal regret with the matching lower bound.
    Efficient-Dyn: Dynamic Graph Representation Learning via Event-based Temporal Sparse Attention Network. (arXiv:2201.01384v1 [cs.LG])
    (2 min) Static graph neural networks have been widely used in modeling and representation learning of graph structure data. However, many real-world problems, such as social networks, financial transactions, recommendation systems, etc., are dynamic, that is, nodes and edges are added or deleted over time. Therefore, in recent years, dynamic graph neural networks have received more and more attention from researchers. In this work, we propose a novel dynamic graph neural network, Efficient-Dyn. It adaptively encodes temporal information into a sequence of patches with an equal amount of temporal-topological structure. Therefore, while avoiding the use of snapshots to cause information loss, it also achieves a finer time granularity, which is close to what continuous networks could provide. In addition, we also designed a lightweight module, Sparse Temporal Transformer, to compute node representations through both structural neighborhoods and temporal dynamics. Since the fully-connected attention conjunction is simplified, the computation cost is far lower than the current state-of-the-arts. Link prediction experiments are conducted on both continuous and discrete graph datasets. Through comparing with several state-of-the-art graph embedding baselines, the experimental results demonstrate that Efficient-Dyn has a faster inference speed while having competitive performance.
    Self-Supervised Approach to Addressing Zero-Shot Learning Problem. (arXiv:2201.01391v1 [cs.CV])
    (2 min) In recent years, self-supervised learning has had significant success in applications involving computer vision and natural language processing. The type of pretext task is important to this boost in performance. One common pretext task is the measure of similarity and dissimilarity between pairs of images. In this scenario, the two images that make up the negative pair are visibly different to humans. However, in entomology, species are nearly indistinguishable and thus hard to differentiate. In this study, we explored the performance of a Siamese neural network using contrastive loss by learning to push apart embeddings of bumblebee species pair that are dissimilar, and pull together similar embeddings. Our experimental results show a 61% F1-score on zero-shot instances, a performance showing 11% improvement on samples of classes that share intersections with the training set.
  • stat.ML updates on arXiv.org

    Tracking Most Severe Arm Changes in Bandits. (arXiv:2112.13838v2 [cs.LG] UPDATED)
    (2 min) In bandits with distribution shifts, one aims to automatically detect an unknown number $L$ of changes in reward distribution, and restart exploration when necessary. While this problem remained open for many years, a recent breakthrough of Auer et al. (2018, 2019) provide the first adaptive procedure to guarantee an optimal (dynamic) regret $\sqrt{LT}$, for $T$ rounds, with no knowledge of $L$. However, not all distributional shifts are equally severe, e.g., suppose no best arm switches occur, then we cannot rule out that a regret $O(\sqrt{T})$ may remain possible; in other words, is it possible to achieve dynamic regret that optimally scales only with an unknown number of severe shifts? This unfortunately has remained elusive, despite various attempts (Auer et al., 2019, Foster et al., 2020). We resolve this problem in the case of two-armed bandits: we derive an adaptive procedure that guarantees a dynamic regret of order $\tilde{O}(\sqrt{\tilde{L} T})$, where $\tilde L \ll L$ captures an unknown number of severe best arm changes, i.e., with significant switches in rewards, and which last sufficiently long to actually require a restart. As a consequence, for any number $L$ of distributional shifts outside of these severe shifts, our procedure achieves regret just $\tilde{O}(\sqrt{T})\ll \tilde{O}(\sqrt{LT})$. Finally, we note that our notion of severe shift applies in both classical settings of stochastic switching bandits and of adversarial bandits.
    Deep Reinforcement Learning at the Edge of the Statistical Precipice. (arXiv:2108.13264v4 [cs.LG] UPDATED)
    (3 min) Deep reinforcement learning (RL) algorithms are predominantly evaluated by comparing their relative performance on a large suite of tasks. Most published results on deep RL benchmarks compare point estimates of aggregate performance such as mean and median scores across tasks, ignoring the statistical uncertainty implied by the use of a finite number of training runs. Beginning with the Arcade Learning Environment (ALE), the shift towards computationally-demanding benchmarks has led to the practice of evaluating only a small number of runs per task, exacerbating the statistical uncertainty in point estimates. In this paper, we argue that reliable evaluation in the few run deep RL regime cannot ignore the uncertainty in results without running the risk of slowing down progress in the field. We illustrate this point using a case study on the Atari 100k benchmark, where we find substantial discrepancies between conclusions drawn from point estimates alone versus a more thorough statistical analysis. With the aim of increasing the field's confidence in reported results with a handful of runs, we advocate for reporting interval estimates of aggregate performance and propose performance profiles to account for the variability in results, as well as present more robust and efficient aggregate metrics, such as interquartile mean scores, to achieve small uncertainty in results. Using such statistical tools, we scrutinize performance evaluations of existing algorithms on other widely used RL benchmarks including the ALE, Procgen, and the DeepMind Control Suite, again revealing discrepancies in prior comparisons. Our findings call for a change in how we evaluate performance in deep RL, for which we present a more rigorous evaluation methodology, accompanied with an open-source library rliable, to prevent unreliable results from stagnating the field.
    Systematic assessment of the quality of fit of the stochastic block model for empirical networks. (arXiv:2201.01658v1 [physics.soc-ph])
    (2 min) We perform a systematic analysis of the quality of fit of the stochastic block model (SBM) for 275 empirical networks spanning a wide range of domains and orders of size magnitude. We employ posterior predictive model checking as a criterion to assess the quality of fit, which involves comparing networks generated by the inferred model with the empirical network, according to a set of network descriptors. We observe that the SBM is capable of providing an accurate description for the majority of networks considered, but falls short of saturating all modeling requirements. In particular, networks possessing a large diameter and slow-mixing random walks tend to be badly described by the SBM. However, contrary to what is often assumed, networks with a high abundance of triangles can be well described by the SBM in many cases. We demonstrate that simple network descriptors can be used to evaluate whether or not the SBM can provide a sufficiently accurate representation, potentially pointing to possible model extensions that can systematically improve the expressiveness of this class of models.
    New Hard-thresholding Rules based on Data Splitting in High-dimensional Imbalanced Classification. (arXiv:2111.03306v2 [stat.ME] UPDATED)
    (2 min) In binary classification, imbalance refers to situations in which one class is heavily under-represented. This issue is due to either a data collection process or because one class is indeed rare in a population. Imbalanced classification frequently arises in applications such as biology, medicine, engineering, and social sciences. In this paper, for the first time, we theoretically study the impact of imbalance class sizes on the linear discriminant analysis (LDA) in high dimensions. We show that due to data scarcity in one class, referred to as the minority class, and high-dimensionality of the feature space, the LDA ignores the minority class yielding a maximum misclassification rate. We then propose a new construction of hard-thresholding rules based on a data splitting technique that reduces the large difference between the misclassification rates. We show that the proposed method is asymptotically optimal. We further study two well-known sparse versions of the LDA in imbalanced cases. We evaluate the finite-sample performance of different methods using simulations and by analyzing two real data sets. The results show that our method either outperforms its competitors or has comparable performance based on a much smaller subset of selected features, while being computationally more efficient.
    FeatureEnVi: Visual Analytics for Feature Engineering Using Stepwise Selection and Semi-Automatic Extraction Approaches. (arXiv:2103.14539v3 [cs.LG] UPDATED)
    (2 min) The machine learning (ML) life cycle involves a series of iterative steps, from the effective gathering and preparation of the data, including complex feature engineering processes, to the presentation and improvement of results, with various algorithms to choose from in every step. Feature engineering in particular can be very beneficial for ML, leading to numerous improvements such as boosting the predictive results, decreasing computational times, reducing excessive noise, and increasing the transparency behind the decisions taken during the training. Despite that, while several visual analytics tools exist to monitor and control the different stages of the ML life cycle (especially those related to data and algorithms), feature engineering support remains inadequate. In this paper, we present FeatureEnVi, a visual analytics system specifically designed to assist with the feature engineering process. Our proposed system helps users to choose the most important feature, to transform the original features into powerful alternatives, and to experiment with different feature generation combinations. Additionally, data space slicing allows users to explore the impact of features on both local and global scales. FeatureEnVi utilizes multiple automatic feature selection techniques; furthermore, it visually guides users with statistical evidence about the influence of each feature (or subsets of features). The final outcome is the extraction of heavily engineered features, evaluated by multiple validation metrics. The usefulness and applicability of FeatureEnVi are demonstrated with two use cases and a case study. We also report feedback from interviews with two ML experts and a visualization researcher who assessed the effectiveness of our system.
    Meta Two-Sample Testing: Learning Kernels for Testing with Limited Data. (arXiv:2106.07636v2 [stat.ML] UPDATED)
    (2 min) Modern kernel-based two-sample tests have shown great success in distinguishing complex, high-dimensional distributions with appropriate learned kernels. Previous work has demonstrated that this kernel learning procedure succeeds, assuming a considerable number of observed samples from each distribution. In realistic scenarios with very limited numbers of data samples, however, it can be challenging to identify a kernel powerful enough to distinguish complex distributions. We address this issue by introducing the problem of meta two-sample testing (M2ST), which aims to exploit (abundant) auxiliary data on related tasks to find an algorithm that can quickly identify a powerful test on new target tasks. We propose two specific algorithms for this task: a generic scheme which improves over baselines and a more tailored approach which performs even better. We provide both theoretical justification and empirical evidence that our proposed meta-testing schemes out-perform learning kernel-based tests directly from scarce observations, and identify when such schemes will be successful.
    Online Adaptation to Label Distribution Shift. (arXiv:2107.04520v2 [cs.LG] UPDATED)
    (2 min) Machine learning models often encounter distribution shifts when deployed in the real world. In this paper, we focus on adaptation to label distribution shift in the online setting, where the test-time label distribution is continually changing and the model must dynamically adapt to it without observing the true label. Leveraging a novel analysis, we show that the lack of true label does not hinder estimation of the expected test loss, which enables the reduction of online label shift adaptation to conventional online learning. Informed by this observation, we propose adaptation algorithms inspired by classical online learning techniques such as Follow The Leader (FTL) and Online Gradient Descent (OGD) and derive their regret bounds. We empirically verify our findings under both simulated and real world label distribution shifts and show that OGD is particularly effective and robust to a variety of challenging label shift scenarios.
    Algorithmic Stability and Generalization of an Unsupervised Feature Selection Algorithm. (arXiv:2010.09416v2 [cs.LG] UPDATED)
    (2 min) Feature selection, as a vital dimension reduction technique, reduces data dimension by identifying an essential subset of input features, which can facilitate interpretable insights into learning and inference processes. Algorithmic stability is a key characteristic of an algorithm regarding its sensitivity to perturbations of input samples. In this paper, we propose an innovative unsupervised feature selection algorithm attaining this stability with provable guarantees. The architecture of our algorithm consists of a feature scorer and a feature selector. The scorer trains a neural network (NN) to globally score all the features, and the selector adopts a dependent sub-NN to locally evaluate the representation abilities for selecting features. Further, we present algorithmic stability analysis and show that our algorithm has a performance guarantee via a generalization error bound. Extensive experimental results on real-world datasets demonstrate superior generalization performance of our proposed algorithm to strong baseline methods. Also, the properties revealed by our theoretical analysis and the stability of our algorithm-selected features are empirically confirmed.
    Online certification of preference-based fairness for personalized recommender systems. (arXiv:2104.14527v4 [cs.LG] UPDATED)
    (2 min) Recommender systems are facing scrutiny because of their growing impact on the opportunities we have access to. Current audits for fairness are limited to coarse-grained parity assessments at the level of sensitive groups. We propose to audit for envy-freeness, a more granular criterion aligned with individual preferences: every user should prefer their recommendations to those of other users. Since auditing for envy requires to estimate the preferences of users beyond their existing recommendations, we cast the audit as a new pure exploration problem in multi-armed bandits. We propose a sample-efficient algorithm with theoretical guarantees that it does not deteriorate user experience. We also study the trade-offs achieved on real-world recommendation datasets.
    The OARF Benchmark Suite: Characterization and Implications for Federated Learning Systems. (arXiv:2006.07856v3 [cs.LG] UPDATED)
    (2 min) This paper presents and characterizes an Open Application Repository for Federated Learning (OARF), a benchmark suite for federated machine learning systems. Previously available benchmarks for federated learning have focused mainly on synthetic datasets and use a limited number of applications. OARF mimics more realistic application scenarios with publicly available data sets as different data silos in image, text and structured data. Our characterization shows that the benchmark suite is diverse in data size, distribution, feature distribution and learning task complexity. The extensive evaluations with reference implementations show the future research opportunities for important aspects of federated learning systems. We have developed reference implementations, and evaluated the important aspects of federated learning, including model accuracy, communication cost, throughput and convergence time. Through these evaluations, we discovered some interesting findings such as federated learning can effectively increase end-to-end throughput.
    Stein Latent Optimization for Generative Adversarial Networks. (arXiv:2106.05319v4 [cs.LG] UPDATED)
    (2 min) Generative adversarial networks (GANs) with clustered latent spaces can perform conditional generation in a completely unsupervised manner. In the real world, the salient attributes of unlabeled data can be imbalanced. However, most of existing unsupervised conditional GANs cannot cluster attributes of these data in their latent spaces properly because they assume uniform distributions of the attributes. To address this problem, we theoretically derive Stein latent optimization that provides reparameterizable gradient estimations of the latent distribution parameters assuming a Gaussian mixture prior in a continuous latent space. Structurally, we introduce an encoder network and novel unsupervised conditional contrastive loss to ensure that data generated from a single mixture component represent a single attribute. We confirm that the proposed method, named Stein Latent Optimization for GANs (SLOGAN), successfully learns balanced or imbalanced attributes and achieves state-of-the-art unsupervised conditional generation performance even in the absence of attribute information (e.g., the imbalance ratio). Moreover, we demonstrate that the attributes to be learned can be manipulated using a small amount of probe data.
    Matrix Completion with Hierarchical Graph Side Information. (arXiv:2201.01728v1 [stat.ML])
    (2 min) We consider a matrix completion problem that exploits social or item similarity graphs as side information. We develop a universal, parameter-free, and computationally efficient algorithm that starts with hierarchical graph clustering and then iteratively refines estimates both on graph clustering and matrix ratings. Under a hierarchical stochastic block model that well respects practically-relevant social graphs and a low-rank rating matrix model (to be detailed), we demonstrate that our algorithm achieves the information-theoretic limit on the number of observed matrix entries (i.e., optimal sample complexity) that is derived by maximum likelihood estimation together with a lower-bound impossibility result. One consequence of this result is that exploiting the hierarchical structure of social graphs yields a substantial gain in sample complexity relative to the one that simply identifies different groups without resorting to the relational structure across them. We conduct extensive experiments both on synthetic and real-world datasets to corroborate our theoretical results as well as to demonstrate significant performance improvements over other matrix completion algorithms that leverage graph side information.
    Optimal Transport using GANs for Lineage Tracing. (arXiv:2007.12098v3 [cs.LG] UPDATED)
    (2 min) In this paper, we present Super-OT, a novel approach to computational lineage tracing that combines a supervised learning framework with optimal transport based on Generative Adversarial Networks (GANs). Unlike previous approaches to lineage tracing, Super-OT has the flexibility to integrate paired data. We benchmark Super-OT based on single-cell RNA-seq data against Waddington-OT, a popular approach for lineage tracing that also employs optimal transport. We show that Super-OT achieves gains over Waddington-OT in predicting the class outcome of cells during differentiation, since it allows the integration of additional information during training.
    Asymptotics of $\ell_2$ Regularized Network Embeddings. (arXiv:2201.01689v1 [stat.ML])
    (2 min) A common approach to solving tasks, such as node classification or link prediction, on a large network begins by learning a Euclidean embedding of the nodes of the network, from which regular machine learning methods can be applied. For unsupervised random walk methods such as DeepWalk and node2vec, adding a $\ell_2$ penalty on the embedding vectors to the loss leads to improved downstream task performance. In this paper we study the effects of this regularization and prove that, under exchangeability assumptions on the graph, it asymptotically leads to learning a nuclear-norm-type penalized graphon. In particular, the exact form of the penalty depends on the choice of subsampling method used within stochastic gradient descent to learn the embeddings. We also illustrate empirically that concatenating node covariates to $\ell_2$ regularized node2vec embeddings leads to comparable, if not superior, performance to methods which incorporate node covariates and the network structure in a non-linear manner.
    Partial Separability and Functional Graphical Models for Multivariate Gaussian Processes. (arXiv:1910.03134v4 [stat.ME] UPDATED)
    (2 min) The covariance structure of multivariate functional data can be highly complex, especially if the multivariate dimension is large, making extensions of statistical methods for standard multivariate data to the functional data setting challenging. For example, Gaussian graphical models have recently been extended to the setting of multivariate functional data by applying multivariate methods to the coefficients of truncated basis expansions. However, a key difficulty compared to multivariate data is that the covariance operator is compact, and thus not invertible. The methodology in this paper addresses the general problem of covariance modeling for multivariate functional data, and functional Gaussian graphical models in particular. As a first step, a new notion of separability for the covariance operator of multivariate functional data is proposed, termed partial separability, leading to a novel Karhunen-Lo\`eve-type expansion for such data. Next, the partial separability structure is shown to be particularly useful in order to provide a well-defined functional Gaussian graphical model that can be identified with a sequence of finite-dimensional graphical models, each of identical fixed dimension. This motivates a simple and efficient estimation procedure through application of the joint graphical lasso. Empirical performance of the method for graphical model estimation is assessed through simulation and analysis of functional brain connectivity during a motor task. %Empirical performance of the method for graphical model estimation is assessed through simulation and analysis of functional brain connectivity during a motor task.
    Bridging Adversarial and Nonstationary Multi-armed Bandit. (arXiv:2201.01628v1 [cs.LG])
    (2 min) In the multi-armed bandit framework, there are two formulations that are commonly employed to handle time-varying reward distributions: adversarial bandit and nonstationary bandit. Although their oracles, algorithms, and regret analysis differ significantly, we provide a unified formulation in this paper that smoothly bridges the two as special cases. The formulation uses an oracle that takes the best-fixed arm within time windows. Depending on the window size, it turns into the oracle in hindsight in the adversarial bandit and dynamic oracle in the nonstationary bandit. We provide algorithms that attain the optimal regret with the matching lower bound.
    Boosting Algorithms for Delivery Time Prediction in Transportation Logistics. (arXiv:2009.11598v2 [cs.LG] UPDATED)
    (2 min) Travel time is a crucial measure in transportation. Accurate travel time prediction is also fundamental for operation and advanced information systems. A variety of solutions exist for short-term travel time predictions such as solutions that utilize real-time GPS data and optimization methods to track the path of a vehicle. However, reliable long-term predictions remain challenging. We show in this paper the applicability and usefulness of travel time i.e. delivery time prediction for postal services. We investigate several methods such as linear regression models and tree based ensembles such as random forest, bagging, and boosting, that allow to predict delivery time by conducting extensive experiments and considering many usability scenarios. Results reveal that travel time prediction can help mitigate high delays in postal services. We show that some boosting algorithms, such as light gradient boosting and catboost, have a higher performance in terms of accuracy and runtime efficiency than other baselines such as linear regression models, bagging regressor and random forest.
    Adversarial Feature Desensitization. (arXiv:2006.04621v3 [cs.LG] UPDATED)
    (2 min) Neural networks are known to be vulnerable to adversarial attacks -- slight but carefully constructed perturbations of the inputs which can drastically impair the network's performance. Many defense methods have been proposed for improving robustness of deep networks by training them on adversarially perturbed inputs. However, these models often remain vulnerable to new types of attacks not seen during training, and even to slightly stronger versions of previously seen attacks. In this work, we propose a novel approach to adversarial robustness, which builds upon the insights from the domain adaptation field. Our method, called Adversarial Feature Desensitization (AFD), aims at learning features that are invariant towards adversarial perturbations of the inputs. This is achieved through a game where we learn features that are both predictive and robust (insensitive to adversarial attacks), i.e. cannot be used to discriminate between natural and adversarial data. Empirical results on several benchmarks demonstrate the effectiveness of the proposed approach against a wide range of attack types and attack strengths. Our code is available at https://github.com/BashivanLab/afd.
    Regret Lower Bounds for Learning Linear Quadratic Gaussian Systems. (arXiv:2201.01680v1 [cs.LG])
    (2 min) This paper presents local minimax regret lower bounds for adaptively controlling linear-quadratic-Gaussian (LQG) systems. We consider smoothly parametrized instances and provide an understanding of when logarithmic regret is impossible which is both instance specific and flexible enough to take problem structure into account. This understanding relies on two key notions: That of local-uninformativeness; when the optimal policy does not provide sufficient excitation for identification of the optimal policy, and yields a degenerate Fisher information matrix; and that of information-regret-boundedness, when the small eigenvalues of a policy-dependent information matrix are boundable in terms of the regret of that policy. Combined with a reduction to Bayesian estimation and application of Van Trees' inequality, these two conditions are sufficient for proving regret bounds on order of magnitude $\sqrt{T}$ in the time horizon, $T$. This method yields lower bounds that exhibit tight dimensional dependencies and scale naturally with control-theoretic problem constants. For instance, we are able to prove that systems operating near marginal stability are fundamentally hard to learn to control. We further show that large classes of systems satisfy these conditions, among them any state-feedback system with both $A$- and $B$-matrices unknown. Most importantly, we also establish that a nontrivial class of partially observable systems, essentially those that are over-actuated, satisfy these conditions, thus providing a $\sqrt{T}$ lower bound also valid for partially observable systems. Finally, we turn to two simple examples which demonstrate that our lower bound captures classical control-theoretic intuition: our lower bounds diverge for systems operating near marginal stability or with large filter gain -- these can be arbitrarily hard to (learn to) control.
    Understanding Entropy Coding With Asymmetric Numeral Systems (ANS): a Statistician's Perspective. (arXiv:2201.01741v1 [stat.ML])
    (2 min) Entropy coding is the backbone data compression. Novel machine-learning based compression methods often use a new entropy coder called Asymmetric Numeral Systems (ANS) [Duda et al., 2015], which provides very close to optimal bitrates and simplifies [Townsend et al., 2019] advanced compression techniques such as bits-back coding. However, researchers with a background in machine learning often struggle to understand how ANS works, which prevents them from exploiting its full versatility. This paper is meant as an educational resource to make ANS more approachable by presenting it from a new perspective of latent variable models and the so-called bits-back trick. We guide the reader step by step to a complete implementation of ANS in the Python programming language, which we then generalize for more advanced use cases. We also present and empirically evaluate an open-source library of various entropy coders designed for both research and production use. Related teaching videos and problem sets are available online.
    Inverse Extended Kalman Filter. (arXiv:2201.01539v1 [math.OC])
    (2 min) Recent advances in counter-adversarial systems have garnered significant research interest in inverse filtering from a Bayesian perspective. For example, interest in estimating the adversary's Kalman filter tracked estimate with the purpose of predicting the adversary's future steps has led to recent formulations of inverse Kalman filter (I-KF). In this context of inverse filtering, we address the key challenges of nonlinear process dynamics and unknown input to the forward filter by proposing inverse extended Kalman filter (I-EKF). We derive I-EKF with and without an unknown input by considering nonlinearity in both forward and inverse state-space models. In the process, I-KF-with-unknown-input is also obtained. We then provide theoretical stability guarantees using both bounded nonlinearity and unknown matrix approaches. We further generalize these formulations and results to the case of higher-order, Gaussian-sum, and dithered I-EKFs. Numerical experiments validate our methods for various proposed inverse filters using the recursive Cram\'er-Rao lower bound as a benchmark.
    Convergence and Complexity of Stochastic Block Majorization-Minimization. (arXiv:2201.01652v1 [math.OC])
    (2 min) Stochastic majorization-minimization (SMM) is an online extension of the classical principle of majorization-minimization, which consists of sampling i.i.d. data points from a fixed data distribution and minimizing a recursively defined majorizing surrogate of an objective function. In this paper, we introduce stochastic block majorization-minimization, where the surrogates can now be only block multi-convex and a single block is optimized at a time within a diminishing radius. Relaxing the standard strong convexity requirements for surrogates in SMM, our framework gives wider applicability including online CANDECOMP/PARAFAC (CP) dictionary learning and yields greater computational efficiency especially when the problem dimension is large. We provide an extensive convergence analysis on the proposed algorithm, which we derive under possibly dependent data streams, relaxing the standard i.i.d. assumption on data samples. We show that the proposed algorithm converges almost surely to the set of stationary points of a nonconvex objective under constraints at a rate $O((\log n)^{1+\eps}/n^{1/2})$ for the empirical loss function and $O((\log n)^{1+\eps}/n^{1/4})$ for the expected loss function, where $n$ denotes the number of data samples processed. Under some additional assumption, the latter convergence rate can be improved to $O((\log n)^{1+\eps}/n^{1/2})$. Our results provide first convergence rate bounds for various online matrix and tensor decomposition algorithms under a general Markovian data setting.
  • The Official NVIDIA Blog

    Scooping up Customers: Startup’s No-Code AI Gains Traction for Industrial Inspection
    (3 min) Bill Kish founded Ruckus Wireless two decades ago to make Wi-Fi networking easier. Now, he’s doing the same for computer vision in industrial AI. In 2015, Kish started Cogniac, a company that offers a self-service computer vision platform and development support. Like in the early days of Wi-Fi deployment, the rollout of AI is challenging, Read article > The post Scooping up Customers: Startup’s No-Code AI Gains Traction for Industrial Inspection appeared first on The Official NVIDIA Blog.
  • AI Weirdness

    Remember those wooden playgrounds? AI doesn't.
    (3 min) I remember when my hometown got one of these giant wooden playgrounds. It must have been in the early 90s and a kid could get lost for hours in there. Or injured or full of splinters or chromated copper arsenate I guess, which is why you don't see
    Bonus: these playgrounds might not be exactly safe
    (1 min) AI Weirdness: the strange side of machine learning
  • AWS Machine Learning Blog

    Blur faces in videos automatically with Amazon Rekognition Video
    (8 min) With the advent of artificial intelligence (AI) and machine learning (ML), customers and the general public have become increasingly aware of their privacy, as well as the value that it holds in today’s data-driven world. Enterprises are actively seeking out and marketing privacy-first solutions, especially in the Computer Vision (CV) domain. They need to reassure […]
  • Artificial Intelligence

    Does anyone know what ai software may have been used to make this? 👌🌸🥰
    (0 min) submitted by /u/6owline1vex [link] [comments]
    This Is The Reason Why Bruce Willis Is Licensing Deepfake Image Rights
    (0 min) submitted by /u/sopadebombillas [link] [comments]
    Record Chess Moves with Webcam for Online Play
    (0 min) Hello! I would like to share with you my project which records moves made on your real life chess board using webcam. It also transmits the moves to any chess website for playing online. https://github.com/karayaman/Play-online-chess-with-real-chess-board submitted by /u/Savings_Ad904 [link] [comments]
    [R] Baidu’s 10-Billion Scale ERNIE-ViLG Unified Generative Pretraining Framework Achieves SOTA Performance on Bidirectional Vision-Language Generation Tasks
    (1 min) Baidu researchers propose ERNIE-ViLG, a 10-billion parameter scale pretraining framework for bidirectional text-image generation. Pretrained on 145 million (Chinese) image-text pairs, ERNIE-ViLG achieves state-of-the-art performance on both text-to-image and image-to-text generation tasks. Here is a quick read: Baidu’s 10-Billion Scale ERNIE-ViLG Unified Generative Pretraining Framework Achieves SOTA Performance on Bidirectional Vision-Language Generation Tasks. The paper ERNIE-ViLG: Unified Generative Pre-training for Bidirectional Vision-Language Generation is on arXiv. submitted by /u/Yuqing7 [link] [comments]
    Development & Application Of Responsible AI For Allied Defense & Security - Dr. Nikos Loutas Ph.D., Head of Data & ArtificiaI Intelligence (AI) Policy, NATO
    (1 min) submitted by /u/ObjectiveGround5 [link] [comments]
    What are the biggest AI events in Europe?
    (1 min) I want to start attending AI trade shows and conferences in Europe, but there seem to be many and I don't know where to start. I'm particularly interested in applying to be a speaker so I can share my thesis insights with peers. Thanks in advance! submitted by /u/kerfufflewhoople [link] [comments]
    Upscale images
    (1 min) Hey what is the best way to upscale lots of 500x500x png images? Any website/program/colab recommendations? Thanks for the help submitted by /u/BananaDrum [link] [comments]
    I hope you find this video entertaining or informative! Artificial Intelligence: Life is Weird
    (0 min) submitted by /u/TonyTalksBackPodcast [link] [comments]
  • Machine Learning

    Neural Architecture Search (NAS) [D]
    (2 min) The goal of NAS is to find an optimal architecture of a neural network for a given problem. In my case, I am interested in creating an algorithm for finding the architecture of a Convolutional Neural Network for an Image dataset. The problem with any NAS algorithm is that the only true way to know if a created architecture is good is by running the model to convergence, thus increasing computational resources spent. There have been many possible solutions to gauge the relative trained accuracy of model. The most popular is to simply reduce the dataset, reuse training weights, and/or train for only a couple of iterations. I was wanting to hear some thoughts... I can save computational resources by only training each possible model for only 3 iterations and gauge its success using its training accuracy. The problem however with using training accuracy is that it does not gauge how well the model will generalize, only if it is capable of learning anything from the training data, if not overfitting. The problem with using the validation accuracy after 3 iterations is that most models have extremely poor validation accuracy scores early on during training, only catching up around iteration 10, therefore in order to fully gauged the validation accuracy the models would have to be trained until iteration 10, 3x more than using the training accuracy after 3 iterations. Would you rather train the models for only 3 iterations, using training accuracy as the metric to compare models, risking the models overfitting when trained until convergence; or, would you rather play the safe game and train models for 10 iterations using validation accuracy as a metric to compare models, but spend 3x the amount of computational resources? submitted by /u/MyActualUserName99 [link] [comments]
    [R] Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets
    (0 min) submitted by /u/hardmaru [link] [comments]
    [D] Is model calibration important for industry practitioners?
    (2 min) There has been a lot of recent work on model calibration — ensuring a match between a model’s confidence and correctness. This seems critical in applications where confidence/safety matters (e.g. perception related parts of self-driving), but I’m curious whether model calibration is something important for industry practitioners. submitted by /u/Zestyclose-Orange468 [link] [comments]
    [P] ML Contests: Competitive ML Community Website + Discord
    (1 min) I posted here a while ago about a website I created that aggregates competitions across Kaggle and other websites: https://mlcontests.com (open source and community maintained - GitHub link at the bottom of the page) https://preview.redd.it/7poem4aqmaa81.png?width=1100&format=png&auto=webp&s=362666ed2bec33d88cdb2c03b84371891e29a928 Based on some feedback I implemented a few extra features, and today I launched a Discord community to go along with it - this allows people to find teammates to collaborate with on a competition, and to get real-time alerts when new competitions are launched. There are also a few team members from the competition platforms in the Discord already. https://preview.redd.it/5a0tzdjomaa81.png?width=673&format=png&auto=webp&s=c78df3326d6a8574eba2e1cd94245d6563f29df3 Also welcome is discussion about ongoing competitions, and post-mortems/debriefs after competitions end. If this sounds interesting to you, you can join the Discord here: https://discord.com/invite/nM482gc9h8 Hope to see you there! submitted by /u/hcarlens [link] [comments]
    [D] MC Dropout is not Bayesian, so why are so many papers still using it for that?
    (3 min) In the domain of uncertainty estimation in places like segmentation, MC Dropout is used as an approximation of Bayesian computations. I've seen a thread from a few years ago talking about its issues and I've just read this recent paper called Is MC Dropout Bayesian?, and they conclude it isn't. They propose a new method for parametric VI based on the reparametrization trick and stochastic backpropagation, but my maths isn't good enough to fully understand it. Does anyone have any thoughts on this? I find it a little strange that every uncertainty estimation paper in segmentation just uses MC Dropout as a default, with just a throwaway line that it is "approximately Bayesian", even though it isn't! submitted by /u/shellyturnwarm [link] [comments]
    How to Implement an Efficient LayerNorm CUDA Kernel[R]
    (1 min) Article:https://oneflow2020.medium.com/how-to-implement-an-efficient-layernorm-cuda-kernel-oneflow-performance-optimization-731e91a285b8 Code:https://github.com/Oneflow-Inc/oneflow/ In a previous article, we discussed OneFlow’s techniques for optimizing the Softmax CUDA Kernel. The performance of the OneFlow-optimized Softmax greatly exceeds that of the Softmax of CuDNN, and OneFlow also fully optimizes half types that many frameworks do not take into account. Here, we share OneFlow’s approach for optimizing the performance of another important operator, LayerNorm. submitted by /u/Just0by [link] [comments]
    [D] Neuro-Symbolic AI
    (1 min) Hello, I recently came across neuro-symbolic ai, and the concept looked really cool. However, except for a 2-3 papers from IBM-MIT research and a couple of blogs, I couldn’t find any work in this direction. Apart from VQA, what kind of potential domains can NS be applied to? Are there any caveats I am not aware of as to why not many from the community are not working in this direction? Thank you! submitted by /u/NightlessBaron [link] [comments]
    [D] GAN training initially degrades results of pre-trained generator
    (3 min) Hello, I have an issue with the training of a GAN, which consists of a generator and two discriminators, each discriminator working on one specific domain (i.e. time domain and time-frequency domain). The generator is used to generate waveforms, provided other waveforms (i.e. noisy waveform to de-noised waveform). ​ The whole GAN is trained in the following way: 1-The generator is independently pre-trained by regression, until validation losses reach a plateau. 2-The two randomly initialized discriminators are then activated, and GANs training takes place. ​ The output of step 1. is a raw version of the desired output. However, step 2 inititially degrades the output of the generator, by removing too much information. This is also reflected in all the validation losses, whose values…
    [D]DATACAMP & Data Leakage Practices. Does Datacamp need a quality audit?
    (2 min) I must be going crazy. Anyone take Datacamp courses and do projects? So quite a lot of projects and courses actually apply data preprocessing techniques before splitting data that encourage data leakage. I can't believe I paid for their services...... Anyone catch other bad habits that Datacamp encourages? Aside from being a rather ineffective platform to learn? submitted by /u/THE_REAL_ODB [link] [comments]
    [D] How to choose the appropriate network: Pretrained Image Recognition Models
    (2 min) Hi! I am currently involved in a project that needs me to classify multimodal data - In which one modality is image. I went through many papers on the topic and most of them just seem to choose a random pretrained architecture for the same - VGG19 being the most common. I found little to no reason for the same except for the performance being better than other models. So basically, I want to know whether there is a way to choose the right model? And if not, which one out of Inception, VGG and ResNet are the best for binary image classification on a large scale? Thanks! submitted by /u/prabhav55221 [link] [comments]
  • Neural Networks, Deep Learning and Machine Learning

    Seq2Seq learning resources
    (0 min) I am interested in learning Seq2Seq family of ML/DL. Can you provide free learning resources to cover topics like RNN, LSTM, GRU, Transformer? Thanks submitted by /u/grid_world [link] [comments]
    Responsive ai dialogue for gemes with NN?
    (1 min) Hello, Currently I am learning NN algos, because I want to create a project, but during this time I got an idea for another project, I'm curious if its possible. Some of us have their favourite games of all times as do I, i.e. skyrim. And I got the second idea for the project. What if I created some kind of a "chat bot" that runs in the background in real time. It would allow player, to communicate with npcs with their own voice and ask them something that is not scripted in the game itself, but is "lore friendly", by this I mean no questions about trump, biden or tiktok lol. Of course the second step would be to synthesize the voices of npcs for their generated dialogue. What I'd like to know is if that is possible, not only from proggramimg perspecitve(cuz anything given enough time would be completed), but is it possible for modern computers to run? It is just a crazy idea of a begginer lol submitted by /u/skollehatti [link] [comments]
    Are there any good books or videos for beginners?
    (1 min) I’m only 15 and I don’t know much about how neural networks actually function, I have been very interested in this subject for a while and I wanted to learn more. Any recommendations? submitted by /u/Forever061 [link] [comments]
  • Reinforcement Learning

    Reinforcement learning for the real world
    (0 min) submitted by /u/bendee983 [link] [comments]
    Can't find where the agent is in the code from this paper (I am a newbie)
    (2 min) Hi all, I am trying to look at the code from a paper (https://github.com/anirudh9119/shared_workspace/tree/main/Triangle). I can code fluently and I know RL quite well in theory, the problem is I don't have much hands-on experience yet. Perhaps this is why I get confused when looking at this repo, not because I don't understand how the algorithm works, but because I can't find the code for it. Specifically, I would like to take the code for this algorithm: https://preview.redd.it/vip5wrq3p9a81.png?width=957&format=png&auto=webp&s=4735929cbbd651d1d26dcbdfcb33b2a99be28c16 Naively, I would expect to see one file with all the classes and functions needed for this agent, and then another file with the agent imported and the reinforcement learning task performed in a certain environment. Instead, in this repo there are so many files with so many classes and so many functions, that I get lost and I can't identify which code refers to this algorithm. I can't see where the agent is, I can't see where the task is performed and so on. I don't understand if the repo is actually messy, or if it's just due to the fact that I'm not interpreting it in the right way. Thanks! submitted by /u/No_Possibility_7588 [link] [comments]
    What is the current SOTA for Offline RL?
    (1 min) Hi everyone! I'm mostly interested in Offline RL approaches for environments with distribution shift. I'm reading Decision Transformer: Reinforcement Learning via Sequence Modeling (https://arxiv.org/abs/2106.01345) paper, and was wondering what would be the benchmark / SOTA right now? submitted by /u/fusionquant [link] [comments]
    Grab your digital copy of The TensorFlow Workshop
    (1 min) Packt has Published "The TensorFlow Workshop " ​ Grab your digital copy now if you feel you are interested. ​ As part of our marketing activities, we are offering free digital copies of the book in return for unbiased feedback in the form of a reader review. ​ Get started with TensorFlow fundamentals to build and train deep learning models with real-world data, practical exercises, and challenging activities. ​ Here is what you will learn from the book: ​ Get to grips with TensorFlow’s mathematical operations Pre-process a wide variety of tabular, sequential, and image data Understand the purpose and usage of different deep learning layers Perform hyperparameter-tuning to prevent overfitting of training data Use pre-trained models to speed up the development of learning models Generate new data based on existing patterns using generative models ​ ## Key Features ​ * Understand the fundamentals of tensors, neural networks, and deep learning * Discover how to implement and fine-tune deep learning models for real-world datasets * Build your experience and confidence with hands-on exercises and activities ​ Please comment below or DM me for more details submitted by /u/RoyluisRodrigues [link] [comments]
    How to draw the path of the end effector in mujoco_py
    (1 min) I’m working on a project where my robotic arm should trace a hand written letter. For me to show that my arm is indeed tracing out the letter it would be better if I can draw a line as the end effector moves. Does anyone know how can I do that? submitted by /u/the_loner_98 [link] [comments]
    "Hidden Agenda: a Social Deduction Game with Diverse Learned Equilibria", Kopparapu et al 2022 {DM} (sus)
    (1 min) submitted by /u/gwern [link] [comments]
    Defining actor/critic networks under one class
    (1 min) Hello everyone, I am currently using a PPO implementation (using pytorch) that has the actor and critic networks defined under one class (Class ActorCritic(nn.Module)), where each network is defined using nn.sequential (self.actor = nn.sequential ... and self.critic=nn.sequential). Is there a way to define both networks in the same class but using the forward method instead of nn.sequential? submitted by /u/AhmedNizam_ [link] [comments]

2022-01-06

  • cs.LG updates on arXiv.org

    Hierarchical Learning to Solve Partial Differential Equations Using Physics-Informed Neural Networks. (arXiv:2112.01254v2 [cs.LG] UPDATED)
    (2 min) The neural network-based approach to solving partial differential equations has attracted considerable attention due to its simplicity and flexibility in representing the solution of the partial differential equation. In training a neural network, the network learns global features corresponding to low-frequency components while high-frequency components are approximated at a much slower rate. For a class of equations in which the solution contains a wide range of scales, the network training process can suffer from slow convergence and low accuracy due to its inability to capture the high-frequency components. In this work, we propose a hierarchical approach to improve the convergence rate and accuracy of the neural network solution to partial differential equations. The proposed method comprises multi-training levels in which a newly introduced neural network is guided to learn the residual of the previous level approximation. By the nature of neural networks' training process, the high-level correction is inclined to capture the high-frequency components. We validate the efficiency and robustness of the proposed hierarchical approach through a suite of linear and nonlinear partial differential equations.
    Towards Deeper Deep Reinforcement Learning with Spectral Normalization. (arXiv:2106.01151v4 [cs.LG] UPDATED)
    (2 min) In computer vision and natural language processing, innovations in model architecture that increase model capacity have reliably translated into gains in performance. In stark contrast with this trend, state-of-the-art reinforcement learning (RL) algorithms often use small MLPs, and gains in performance typically originate from algorithmic innovations. It is natural to hypothesize that small datasets in RL necessitate simple models to avoid overfitting; however, this hypothesis is untested. In this paper we investigate how RL agents are affected by exchanging the small MLPs with larger modern networks with skip connections and normalization, focusing specifically on actor-critic algorithms. We empirically verify that naively adopting such architectures leads to instabilities and poor performance, likely contributing to the popularity of simple models in practice. However, we show that dataset size is not the limiting factor, and instead argue that instability from taking gradients through the critic is the culprit. We demonstrate that spectral normalization (SN) can mitigate this issue and enable stable training with large modern architectures. After smoothing with SN, larger models yield significant performance improvements -- suggesting that more "easy" gains may be had by focusing on model architectures in addition to algorithmic innovations.
    Adaptive and Multiple Time-scale Eligibility Traces for Online Deep Reinforcement Learning. (arXiv:2008.10040v2 [cs.RO] UPDATED)
    (2 min) Deep reinforcement learning (DRL) is one promising approach to teaching robots to perform complex tasks. Because methods that directly reuse the stored experience data cannot follow the change of the environment in robotic problems with a time-varying environment, online DRL is required. The eligibility traces method is well known as an online learning technique for improving sample efficiency in traditional reinforcement learning with linear regressors rather than DRL. The dependency between parameters of deep neural networks would destroy the eligibility traces, which is why they are not integrated with DRL. Although replacing the gradient with the most influential one rather than accumulating the gradients as the eligibility traces can alleviate this problem, the replacing operation reduces the number of reuses of previous experiences. To address these issues, this study proposes a new eligibility traces method that can be used even in DRL while maintaining high sample efficiency. When the accumulated gradients differ from those computed using the latest parameters, the proposed method takes into account the divergence between the past and latest parameters to adaptively decay the eligibility traces. Bregman divergences between outputs computed by the past and latest parameters are exploited due to the infeasible computational cost of the divergence between the past and latest parameters. In addition, a generalized method with multiple time-scale traces is designed for the first time. This design allows for the replacement of the most influential adaptively accumulated (decayed) eligibility traces.
    Dynamic Object Removal and Spatio-Temporal RGB-D Inpainting via Geometry-Aware Adversarial Learning. (arXiv:2008.05058v4 [cs.CV] UPDATED)
    (2 min) Dynamic objects have a significant impact on the robot's perception of the environment which degrades the performance of essential tasks such as localization and mapping. In this work, we address this problem by synthesizing plausible color, texture and geometry in regions occluded by dynamic objects. We propose the novel geometry-aware DynaFill architecture that follows a coarse-to-fine topology and incorporates our gated recurrent feedback mechanism to adaptively fuse information from previous timesteps. We optimize our architecture using adversarial training to synthesize fine realistic textures which enables it to hallucinate color and depth structure in occluded regions online in a spatially and temporally coherent manner, without relying on future frame information. Casting our inpainting problem as an image-to-image translation task, our model also corrects regions correlated with the presence of dynamic objects in the scene, such as shadows or reflections. We introduce a large-scale hyperrealistic dataset with RGB-D images, semantic segmentation labels, camera poses as well as groundtruth RGB-D information of occluded regions. Extensive quantitative and qualitative evaluations show that our approach achieves state-of-the-art performance, even in challenging weather conditions. Furthermore, we present results for retrieval-based visual localization with the synthesized images that demonstrate the utility of our approach.
    Interpretable Low-Resource Legal Decision Making. (arXiv:2201.01164v1 [cs.LG])
    (2 min) Over the past several years, legal applications of deep learning have been on the rise. However, as with other high-stakes decision making areas, the requirement for interpretability is of crucial importance. Current models utilized by legal practitioners are more of the conventional machine learning type, wherein they are inherently interpretable, yet unable to harness the performance capabilities of data-driven deep learning models. In this work, we utilize deep learning models in the area of trademark law to shed light on the issue of likelihood of confusion between trademarks. Specifically, we introduce a model-agnostic interpretable intermediate layer, a technique which proves to be effective for legal documents. Furthermore, we utilize weakly supervised learning by means of a curriculum learning strategy, effectively demonstrating the improved performance of a deep learning model. This is in contrast to the conventional models which are only able to utilize the limited number of expensive manually-annotated samples by legal experts. Although the methods presented in this work tackles the task of risk of confusion for trademarks, it is straightforward to extend them to other fields of law, or more generally, to other similar high-stakes application scenarios.
    Trusting Machine Learning Results from Medical Procedures in the Operating Room. (arXiv:2201.01060v1 [cs.LG])
    (2 min) Machine learning can be used to analyse physiological data for several purposes. Detection of cerebral ischemia is an achievement that would have high impact on patient care. We attempted to study if collection of continous physiological data from non-invasive monitors, and analysis with machine learning could detect cerebral ischemia in tho different setting, during surgery for carotid endarterectomy and during endovascular thrombectomy in acute stroke. We compare the results from the two different group and one patient from each group in details. While results from CEA-patients are consistent, those from thrombectomy patients are not and frequently contain extreme values such as 1.0 in accuracy. We conlcude that this is a result of short duration of the procedure and abundance of data with bad quality resulting in small data sets. These results can therefore not be trusted.
    Neural Transfer Learning for Repairing Security Vulnerabilities in C Code. (arXiv:2104.08308v3 [cs.SE] UPDATED)
    (2 min) In this paper, we address the problem of automatic repair of software vulnerabilities with deep learning. The major problem with data-driven vulnerability repair is that the few existing datasets of known confirmed vulnerabilities consist of only a few thousand examples. However, training a deep learning model often requires hundreds of thousands of examples. In this work, we leverage the intuition that the bug fixing task and the vulnerability fixing task are related and that the knowledge learned from bug fixes can be transferred to fixing vulnerabilities. In the machine learning community, this technique is called transfer learning. In this paper, we propose an approach for repairing security vulnerabilities named VRepair which is based on transfer learning. VRepair is first trained on a large bug fix corpus and is then tuned on a vulnerability fix dataset, which is an order of magnitude smaller. In our experiments, we show that a model trained only on a bug fix corpus can already fix some vulnerabilities. Then, we demonstrate that transfer learning improves the ability to repair vulnerable C functions. We also show that the transfer learning model performs better than a model trained with a denoising task and fine-tuned on the vulnerability fixing task. To sum up, this paper shows that transfer learning works well for repairing security vulnerabilities in C compared to learning on a small dataset.
    Differentially Private Transferrable Deep Learning with Membership-Mappings. (arXiv:2105.04615v3 [cs.LG] UPDATED)
    (2 min) This paper considers the problem of differentially private semi-supervised transfer and multi-task learning. The notion of \emph{membership-mapping} has been developed using measure theory basis to learn data representation via a fuzzy membership function. An alternative conception of deep autoencoder, referred to as \emph{Conditionally Deep Membership-Mapping Autoencoder (CDMMA)}, is considered for transferrable deep learning. Under practice-oriented settings, an analytical solution for the learning of CDMMA can be derived by means of variational optimization. The paper proposes a transfer and multi-task learning approach that combines CDMMA with a tailored noise adding mechanism to achieve a given level of privacy-loss bound with the minimum perturbation of the data. Numerous experiments were carried out using MNIST, USPS, Office, and Caltech256 datasets to verify the competitive robust performance of the proposed methodology.
    On the Minimal Adversarial Perturbation for Deep Neural Networks with Provable Estimation Error. (arXiv:2201.01235v1 [cs.LG])
    (2 min) Although Deep Neural Networks (DNNs) have shown incredible performance in perceptive and control tasks, several trustworthy issues are still open. One of the most discussed topics is the existence of adversarial perturbations, which has opened an interesting research line on provable techniques capable of quantifying the robustness of a given input. In this regard, the Euclidean distance of the input from the classification boundary denotes a well-proved robustness assessment as the minimal affordable adversarial perturbation. Unfortunately, computing such a distance is highly complex due the non-convex nature of NNs. Despite several methods have been proposed to address this issue, to the best of our knowledge, no provable results have been presented to estimate and bound the error committed. This paper addresses this issue by proposing two lightweight strategies to find the minimal adversarial perturbation. Differently from the state-of-the-art, the proposed approach allows formulating an error estimation theory of the approximate distance with respect to the theoretical one. Finally, a substantial set of experiments is reported to evaluate the performance of the algorithms and support the theoretical findings. The obtained results show that the proposed strategies approximate the theoretical distance for samples close to the classification boundary, leading to provable robustness guarantees against any adversarial attacks.
    COVID-19 Disease Progression Prediction via Audio Signals: A Longitudinal Study. (arXiv:2201.01232v1 [cs.SD])
    (2 min) Recent work has shown the potential of the use of audio data in screening for COVID-19. However, very little exploration has been done of monitoring disease progression, especially recovery in COVID-19 through audio. Tracking disease progression characteristics and patterns of recovery could lead to tremendous insights and more timely treatment or treatment adjustment, as well as better resources management in health care systems. The primary objective of this study is to explore the potential of longitudinal audio dynamics for COVID-19 monitoring using sequential deep learning techniques, focusing on prediction of disease progression and, especially, recovery trend prediction. We analysed crowdsourced respiratory audio data from 212 individuals over 5 days to 385 days, alongside their self-reported COVID-19 test results. We first explore the benefits of capturing longitudinal dynamics of audio biomarkers for COVID-19 detection. The strong performance, yielding an AUC-ROC of 0.79, sensitivity of 0.75 and specificity of 0.70, supports the effectiveness of the approach compared to methods that do not leverage longitudinal dynamics. We further examine the predicted disease progression trajectory, which displays high consistency with the longitudinal test results with a correlation of 0.76 in the test cohort, and 0.86 in a subset of the test cohort with 12 participants who report disease recovery. Our findings suggest that monitoring COVID-19 progression via longitudinal audio data has enormous potential in the tracking of individuals' disease progression and recovery.
    Fast Approximation of the Sliced-Wasserstein Distance Using Concentration of Random Projections. (arXiv:2106.15427v2 [stat.ML] UPDATED)
    (2 min) The Sliced-Wasserstein distance (SW) is being increasingly used in machine learning applications as an alternative to the Wasserstein distance and offers significant computational and statistical benefits. Since it is defined as an expectation over random projections, SW is commonly approximated by Monte Carlo. We adopt a new perspective to approximate SW by making use of the concentration of measure phenomenon: under mild assumptions, one-dimensional projections of a high-dimensional random vector are approximately Gaussian. Based on this observation, we develop a simple deterministic approximation for SW. Our method does not require sampling a number of random projections, and is therefore both accurate and easy to use compared to the usual Monte Carlo approximation. We derive nonasymptotical guarantees for our approach, and show that the approximation error goes to zero as the dimension increases, under a weak dependence condition on the data distribution. We validate our theoretical findings on synthetic datasets, and illustrate the proposed approximation on a generative modeling problem.
    Adaptive Template Enhancement for Improved Person Recognition using Small Datasets. (arXiv:2201.01218v1 [eess.SP])
    (2 min) A novel instance-based method for the classification of electroencephalography (EEG) signals is presented and evaluated in this paper. The non-stationary nature of the EEG signals, coupled with the demanding task of pattern recognition with limited training data as well as the potentially noisy signal acquisition conditions, have motivated the work reported in this study. The proposed adaptive template enhancement mechanism transforms the feature-level instances by treating each feature dimension separately, hence resulting in improved class separation and better query-class matching. The proposed new instance-based learning algorithm is compared with a few related algorithms in a number of scenarios. A clinical grade 64-electrode EEG database, as well as a low-quality (high-noise level) EEG database obtained with a low-cost system using a single dry sensor have been used for evaluations in biometric person recognition. The proposed approach demonstrates significantly improved classification accuracy in both identification and verification scenarios. In particular, this new method is seen to provide a good classification performance for noisy EEG data, indicating its potential suitability for a wide range of applications.
    Survey on the Convergence of Machine Learning and Blockchain. (arXiv:2201.00976v1 [cs.LG])
    (2 min) Machine learning (ML) has been pervasively researched nowadays and it has been applied in many aspects of real life. Nevertheless, issues of model and data still accompany the development of ML. For instance, training of traditional ML models is limited to the access of data sets, which are generally proprietary; published ML models may soon be out of date without update of new data and continuous training; malicious data contributors may upload wrongly labeled data that leads to undesirable training results; and the abuse of private data and data leakage also exit. With the utilization of blockchain, an emerging and swiftly developing technology, these problems can be efficiently solved. In this paper, we conduct a survey of the convergence of collaborative ML and blockchain. We investigate different ways of combination of these two technologies, and their fields of application. We also discuss the limitations of current research and their future directions.
    FROTE: Feedback Rule-Driven Oversampling for Editing Models. (arXiv:2201.01070v1 [cs.LG])
    (2 min) Machine learning models may involve decision boundaries that change over time due to updates to rules and regulations, such as in loan approvals or claims management. However, in such scenarios, it may take time for sufficient training data to accumulate in order to retrain the model to reflect the new decision boundaries. While work has been done to reinforce existing decision boundaries, very little has been done to cover these scenarios where decision boundaries of the ML models should change in order to reflect new rules. In this paper, we focus on user-provided feedback rules as a way to expedite the ML models update process, and we formally introduce the problem of pre-processing training data to edit an ML model in response to feedback rules such that once the model is retrained on the pre-processed data, its decision boundaries align more closely with the rules. To solve this problem, we propose a novel data augmentation method, the Feedback Rule-Based Oversampling Technique. Extensive experiments using different ML models and real world datasets demonstrate the effectiveness of the method, in particular the benefit of augmentation and the ability to handle many feedback rules.
    DeepVisualInsight: Time-Travelling Visualization for Spatio-Temporal Causality of Deep Classification Training. (arXiv:2201.01155v1 [cs.LG])
    (2 min) Understanding how the predictions of deep learning models are formed during the training process is crucial to improve model performance and fix model defects, especially when we need to investigate nontrivial training strategies such as active learning, and track the root cause of unexpected training results such as performance degeneration. In this work, we propose a time-travelling visual solution DeepVisualInsight (DVI), aiming to manifest the spatio-temporal causality while training a deep learning image classifier. The spatio-temporal causality demonstrates how the gradient-descent algorithm and various training data sampling techniques can influence and reshape the layout of learnt input representation and the classification boundaries in consecutive epochs. Such causality allows us to observe and analyze the whole learning process in the visible low dimensional space. Technically, we propose four spatial and temporal properties and design our visualization solution to satisfy them. These properties preserve the most important information when inverse-)projecting input samples between the visible low-dimensional and the invisible high-dimensional space, for causal analyses. Our extensive experiments show that, comparing to baseline approaches, we achieve the best visualization performance regarding the spatial/temporal properties and visualization efficiency. Moreover, our case study shows that our visual solution can well reflect the characteristics of various training scenarios, showing good potential of DVI as a debugging tool for analyzing deep learning training processes.
    Discovering Diverse Nearly Optimal Policies with Successor Features. (arXiv:2106.00669v2 [cs.AI] UPDATED)
    (2 min) Finding different solutions to the same problem is a key aspect of intelligence associated with creativity and adaptation to novel situations. In reinforcement learning, a set of diverse policies can be useful for exploration, transfer, hierarchy, and robustness. We propose Diverse Successive Policies, a method for discovering policies that are diverse in the space of Successor Features, while assuring that they are near optimal. We formalize the problem as a Constrained Markov Decision Process (CMDP) where the goal is to find policies that maximize diversity, characterized by an intrinsic diversity reward, while remaining near-optimal with respect to the extrinsic reward of the MDP. We also analyze how recently proposed robustness and discrimination rewards perform and find that they are sensitive to the initialization of the procedure and may converge to sub-optimal solutions. To alleviate this, we propose new explicit diversity rewards that aim to minimize the correlation between the Successor Features of the policies in the set. We compare the different diversity mechanisms in the DeepMind Control Suite and find that the type of explicit diversity we are proposing is important to discover distinct behavior, like for example different locomotion patterns.
    Biased Hypothesis Formation From Projection Pursuit. (arXiv:2201.00889v1 [cs.LG])
    (2 min) The effect of bias on hypothesis formation is characterized for an automated data-driven projection pursuit neural network to extract and select features for binary classification of data streams. This intelligent exploratory process partitions a complete vector state space into disjoint subspaces to create working hypotheses quantified by similarities and differences observed between two groups of labeled data streams. Data streams are typically time sequenced, and may exhibit complex spatio-temporal patterns. For example, given atomic trajectories from molecular dynamics simulation, the machine's task is to quantify dynamical mechanisms that promote function by comparing protein mutants, some known to function while others are nonfunctional. Utilizing synthetic two-dimensional molecules that mimic the dynamics of functional and nonfunctional proteins, biases are identified and controlled in both the machine learning model and selected training data under different contexts. The refinement of a working hypothesis converges to a statistically robust multivariate perception of the data based on a context-dependent perspective. Including diverse perspectives during data exploration enhances interpretability of the multivariate characterization of similarities and differences.
    Machine Learning-Based COVID-19 Patients Triage Algorithm using Patient-Generated Health Data from Nationwide Multicenter Database. (arXiv:2109.09001v3 [cs.LG] UPDATED)
    (2 min) A model that can immediately measure the severity of an epidemic would be of great help to the medical system. In this paper, we will use machine learning to create and analyze a model that can immediately measure the severity of SARS-CoV-2 patients. Since our model uses nationwide dataset that consists of basic personal information of patients, it is of great significance for the patient to be able to check their own severity. Moreover, based on the predicted severity score, we use our model to inform the patients to visit the appropriate clinic center.
    Homological Time Series Analysis of Sensor Signals from Power Plants. (arXiv:2106.02493v4 [cs.LG] UPDATED)
    (2 min) In this paper, we use topological data analysis techniques to construct a suitable neural network classifier for the task of learning sensor signals of entire power plants according to their reference designation system. We use representations of persistence diagrams to derive necessary preprocessing steps and visualize the large amounts of data. We derive deep architectures with one-dimensional convolutional layers combined with stacked long short-term memories as residual networks suitable for processing the persistence features. We combine three separate sub-networks, obtaining as input the time series itself and a representation of the persistent homology for the zeroth and first dimension. We give a mathematical derivation for most of the used hyper-parameters. For validation, numerical experiments were performed with sensor data from four power plants of the same construction type.
    CEMENT: Incomplete Multi-View Weak-Label Learning with Long Tail Labels. (arXiv:2201.01079v1 [cs.LG])
    (2 min) A variety of modern applications exhibit multi-view multi-label learning, where each sample has multi-view features, and multiple labels are correlated via common views. In recent years, several methods have been proposed to cope with it and achieve much success, but still suffer from two key problems: 1) lack the ability to deal with the incomplete multi-view weak-label data, in which only a subset of features and labels are provided for each sample; 2) ignore the presence of noisy views and tail labels usually occurring in real-world problems. In this paper, we propose a novel method, named CEMENT, to overcome the limitations. For 1), CEMENT jointly embeds incomplete views and weak labels into distinct low-dimensional subspaces, and then correlates them via Hilbert-Schmidt Independence Criterion (HSIC). For 2), CEMEMT adaptively learns the weights of embeddings to capture noisy views, and explores an additional sparse component to model tail labels, making the low-rankness available in the multi-label setting. We develop an alternating algorithm to solve the proposed optimization problem. Experimental results on seven real-world datasets demonstrate the effectiveness of the proposed method.
    Classifying Autism from Crowdsourced Semi-Structured Speech Recordings: A Machine Learning Approach. (arXiv:2201.00927v1 [cs.SD])
    (2 min) Autism spectrum disorder (ASD) is a neurodevelopmental disorder which results in altered behavior, social development, and communication patterns. In past years, autism prevalence has tripled, with 1 in 54 children now affected. Given that traditional diagnosis is a lengthy, labor-intensive process, significant attention has been given to developing systems that automatically screen for autism. Prosody abnormalities are among the clearest signs of autism, with affected children displaying speech idiosyncrasies including echolalia, monotonous intonation, atypical pitch, and irregular linguistic stress patterns. In this work, we present a suite of machine learning approaches to detect autism in self-recorded speech audio captured from autistic and neurotypical (NT) children in home environments. We consider three methods to detect autism in child speech: first, Random Forests trained on extracted audio features (including Mel-frequency cepstral coefficients); second, convolutional neural networks (CNNs) trained on spectrograms; and third, fine-tuned wav2vec 2.0--a state-of-the-art Transformer-based ASR model. We train our classifiers on our novel dataset of cellphone-recorded child speech audio curated from Stanford's Guess What? mobile game, an app designed to crowdsource videos of autistic and neurotypical children in a natural home environment. The Random Forest classifier achieves 70% accuracy, the fine-tuned wav2vec 2.0 model achieves 77% accuracy, and the CNN achieves 79% accuracy when classifying children's audio as either ASD or NT. Our models were able to predict autism status when training on a varied selection of home audio clips with inconsistent recording quality, which may be more generalizable to real world conditions. These results demonstrate that machine learning methods offer promise in detecting autism automatically from speech without specialized equipment.
    Towards Fair Recommendation in Two-Sided Platforms. (arXiv:2201.01180v1 [cs.IR])
    (2 min) Many online platforms today (such as Amazon, Netflix, Spotify, LinkedIn, and AirBnB) can be thought of as two-sided markets with producers and customers of goods and services. Traditionally, recommendation services in these platforms have focused on maximizing customer satisfaction by tailoring the results according to the personalized preferences of individual customers. However, our investigation reinforces the fact that such customer-centric design of these services may lead to unfair distribution of exposure to the producers, which may adversely impact their well-being. On the other hand, a pure producer-centric design might become unfair to the customers. As more and more people are depending on such platforms to earn a living, it is important to ensure fairness to both producers and customers. In this work, by mapping a fair personalized recommendation problem to a constrained version of the problem of fairly allocating indivisible goods, we propose to provide fairness guarantees for both sides. Formally, our proposed {\em FairRec} algorithm guarantees Maxi-Min Share ($\alpha$-MMS) of exposure for the producers, and Envy-Free up to One Item (EF1) fairness for the customers. Extensive evaluations over multiple real-world datasets show the effectiveness of {\em FairRec} in ensuring two-sided fairness while incurring a marginal loss in overall recommendation quality. Finally, we present a modification of FairRec (named as FairRecPlus) that at the cost of additional computation time, improves the recommendation performance for the customers, while maintaining the same fairness guarantees.
    Nipping in the Bud: Detection, Diffusion and Mitigation of Hate Speech on Social Media. (arXiv:2201.00961v1 [cs.SI])
    (2 min) Since the proliferation of social media usage, hate speech has become a major crisis. Hateful content can spread quickly and create an environment of distress and hostility. Further, what can be considered hateful is contextual and varies with time. While online hate speech reduces the ability of already marginalised groups to participate in discussion freely, offline hate speech leads to hate crimes and violence against individuals and communities. The multifaceted nature of hate speech and its real-world impact have already piqued the interest of the data mining and machine learning communities. Despite our best efforts, hate speech remains an evasive issue for researchers and practitioners alike. This article presents methodological challenges that hinder building automated hate mitigation systems. These challenges inspired our work in the broader area of combating hateful content on the web. We discuss a series of our proposed solutions to limit the spread of hate speech on social media.
    CHERRY: a Computational metHod for accuratE pRediction of virus-pRokarYotic interactions using a graph encoder-decoder model. (arXiv:2201.01018v1 [q-bio.GN])
    (2 min) Prokaryotic viruses, which infect bacteria and archaea, are key players in microbial communities. Predicting the hosts of prokaryotic viruses helps decipher the dynamic relationship between microbes. Although there are experimental methods for host identification, they are either labor-intensive or require the cultivation of the host cells, creating a need for computational host prediction. Despite some promising results, computational host prediction remains a challenge because of the limited known interactions and the sheer amount of sequenced phages by high-throughput sequencing technologies. The state-of-the-art methods can only achieve 43% accuracy at the species level. This work presents CHERRY, a tool formulating host prediction as link prediction in a knowledge graph. As a virus-prokaryotic interaction prediction tool, CHERRY can be applied to predict hosts for newly discovered viruses and also the viruses infecting antibiotic-resistant bacteria. We demonstrated the utility of CHERRY for both applications and compared its performance with the state-of-the-art methods in different scenarios. To our best knowledge, CHERRY has the highest accuracy in identifying virus-prokaryote interactions. It outperforms all the existing methods at the species level with an accuracy increase of 37%. In addition, CHERRY's performance is more stable on short contigs than other tools.
    Extending CLIP for Category-to-image Retrieval in E-commerce. (arXiv:2112.11294v2 [cs.IR] UPDATED)
    (2 min) E-commerce provides rich multimodal data that is barely leveraged in practice. One aspect of this data is a category tree that is being used in search and recommendation. However, in practice, during a user's session there is often a mismatch between a textual and a visual representation of a given category. Motivated by the problem, we introduce the task of category-to-image retrieval in e-commerce and propose a model for the task, CLIP-ITA. The model leverages information from multiple modalities (textual, visual, and attribute modality) to create product representations. We explore how adding information from multiple modalities (textual, visual, and attribute modality) impacts the model's performance. In particular, we observe that CLIP-ITA significantly outperforms a comparable model that leverages only the visual modality and a comparable model that leverages the visual and attribute modality.
    Aligning Domain-specific Distribution and Classifier for Cross-domain Classification from Multiple Sources. (arXiv:2201.01003v1 [cs.LG])
    (2 min) While Unsupervised Domain Adaptation (UDA) algorithms, i.e., there are only labeled data from source domains, have been actively studied in recent years, most algorithms and theoretical results focus on Single-source Unsupervised Domain Adaptation (SUDA). However, in the practical scenario, labeled data can be typically collected from multiple diverse sources, and they might be different not only from the target domain but also from each other. Thus, domain adapters from multiple sources should not be modeled in the same way. Recent deep learning based Multi-source Unsupervised Domain Adaptation (MUDA) algorithms focus on extracting common domain-invariant representations for all domains by aligning distribution of all pairs of source and target domains in a common feature space. However, it is often very hard to extract the same domain-invariant representations for all domains in MUDA. In addition, these methods match distributions without considering domain-specific decision boundaries between classes. To solve these problems, we propose a new framework with two alignment stages for MUDA which not only respectively aligns the distributions of each pair of source and target domains in multiple specific feature spaces, but also aligns the outputs of classifiers by utilizing the domain-specific decision boundaries. Extensive experiments demonstrate that our method can achieve remarkable results on popular benchmark datasets for image classification.
    A Heterogeneous In-Memory Computing Cluster For Flexible End-to-End Inference of Real-World Deep Neural Networks. (arXiv:2201.01089v1 [cs.AR])
    (2 min) Deployment of modern TinyML tasks on small battery-constrained IoT devices requires high computational energy efficiency. Analog In-Memory Computing (IMC) using non-volatile memory (NVM) promises major efficiency improvements in deep neural network (DNN) inference and serves as on-chip memory storage for DNN weights. However, IMC's functional flexibility limitations and their impact on performance, energy, and area efficiency are not yet fully understood at the system level. To target practical end-to-end IoT applications, IMC arrays must be enclosed in heterogeneous programmable systems, introducing new system-level challenges which we aim at addressing in this work. We present a heterogeneous tightly-coupled clustered architecture integrating 8 RISC-V cores, an in-memory computing accelerator (IMA), and digital accelerators. We benchmark the system on a highly heterogeneous workload such as the Bottleneck layer from a MobileNetV2, showing 11.5x performance and 9.5x energy efficiency improvements, compared to highly optimized parallel execution on the cores. Furthermore, we explore the requirements for end-to-end inference of a full mobile-grade DNN (MobileNetV2) in terms of IMC array resources, by scaling up our heterogeneous architecture to a multi-array accelerator. Our results show that our solution, on the end-to-end inference of the MobileNetV2, is one order of magnitude better in terms of execution latency than existing programmable architectures and two orders of magnitude better than state-of-the-art heterogeneous solutions integrating in-memory computing analog cores.
    Graph Neural Networks With Lifting-based Adaptive Graph Wavelets. (arXiv:2108.01660v3 [cs.LG] UPDATED)
    (2 min) Spectral-based graph neural networks (SGNNs) have been attracting increasing attention in graph representation learning. However, existing SGNNs are limited in implementing graph filters with rigid transforms (e.g., graph Fourier or predefined graph wavelet transforms) and cannot adapt to signals residing on graphs and tasks at hand. In this paper, we propose a novel class of graph neural networks that realizes graph filters with adaptive graph wavelets. Specifically, the adaptive graph wavelets are learned with neural network-parameterized lifting structures, where structure-aware attention-based lifting operations (i.e., prediction and update operations) are developed to jointly consider graph structures and node features. We propose to lift based on diffusion wavelets to alleviate the structural information loss induced by partitioning non-bipartite graphs. By design, the locality and sparsity of the resulting wavelet transform as well as the scalability of the lifting structure are guaranteed. We further derive a soft-thresholding filtering operation by learning sparse graph representations in terms of the learned wavelets, yielding a localized, efficient, and scalable wavelet-based graph filters. To ensure that the learned graph representations are invariant to node permutations, a layer is employed at the input of the networks to reorder the nodes according to their local topology information. We evaluate the proposed networks in both node-level and graph-level representation learning tasks on benchmark citation and bioinformatics graph datasets. Extensive experiments demonstrate the superiority of the proposed networks over existing SGNNs in terms of accuracy, efficiency, and scalability.
    Submix: Practical Private Prediction for Large-Scale Language Models. (arXiv:2201.00971v1 [cs.LG])
    (2 min) Recent data-extraction attacks have exposed that language models can memorize some training samples verbatim. This is a vulnerability that can compromise the privacy of the model's training data. In this work, we introduce SubMix: a practical protocol for private next-token prediction designed to prevent privacy violations by language models that were fine-tuned on a private corpus after pre-training on a public corpus. We show that SubMix limits the leakage of information that is unique to any individual user in the private corpus via a relaxation of group differentially private prediction. Importantly, SubMix admits a tight, data-dependent privacy accounting mechanism, which allows it to thwart existing data-extraction attacks while maintaining the utility of the language model. SubMix is the first protocol that maintains privacy even when publicly releasing tens of thousands of next-token predictions made by large transformer-based models such as GPT-2.
    Rxn Hypergraph: a Hypergraph Attention Model for Chemical Reaction Representation. (arXiv:2201.01196v1 [cs.LG])
    (0 min) It is fundamental for science and technology to be able to predict chemical reactions and their properties. To achieve such skills, it is important to develop good representations of chemical reactions, or good deep learning architectures that can learn such representations automatically from the data. There is currently no universal and widely adopted method for robustly representing chemical reactions. Most existing methods suffer from one or more drawbacks, such as: (1) lacking universality; (2) lacking robustness; (3) lacking interpretability; or (4) requiring excessive manual pre-processing. Here we exploit graph-based representations of molecular structures to develop and test a hypergraph attention neural network approach to solve at once the reaction representation and property-prediction problems, alleviating the aforementioned drawbacks. We evaluate this hypergraph representation in three experiments using three independent data sets of chemical reactions. In all experiments, the hypergraph-based approach matches or outperforms other representations and their corresponding models of chemical reactions while yielding interpretable multi-level representations.
    Finding General Equilibria in Many-Agent Economic Simulations Using Deep Reinforcement Learning. (arXiv:2201.01163v1 [cs.GT])
    (0 min) Real economies can be seen as a sequential imperfect-information game with many heterogeneous, interacting strategic agents of various agent types, such as consumers, firms, and governments. Dynamic general equilibrium models are common economic tools to model the economic activity, interactions, and outcomes in such systems. However, existing analytical and computational methods struggle to find explicit equilibria when all agents are strategic and interact, while joint learning is unstable and challenging. Amongst others, a key reason is that the actions of one economic agent may change the reward function of another agent, e.g., a consumer's expendable income changes when firms change prices or governments change taxes. We show that multi-agent deep reinforcement learning (RL) can discover stable solutions that are epsilon-Nash equilibria for a meta-game over agent types, in economic simulations with many agents, through the use of structured learning curricula and efficient GPU-only simulation and training. Conceptually, our approach is more flexible and does not need unrealistic assumptions, e.g., market clearing, that are commonly used for analytical tractability. Our GPU implementation enables training and analyzing economies with a large number of agents within reasonable time frames, e.g., training completes within a day. We demonstrate our approach in real-business-cycle models, a representative family of DGE models, with 100 worker-consumers, 10 firms, and a government who taxes and redistributes. We validate the learned meta-game epsilon-Nash equilibria through approximate best-response analyses, show that RL policies align with economic intuitions, and that our approach is constructive, e.g., by explicitly learning a spectrum of meta-game epsilon-Nash equilibria in open RBC models.
    Positional Encoding Augmented GAN for the Assessment of Wind Flow for Pedestrian Comfort in Urban Areas. (arXiv:2112.08447v2 [cs.CV] UPDATED)
    (0 min) Approximating wind flows using computational fluid dynamics (CFD) methods can be time-consuming. Creating a tool for interactively designing prototypes while observing the wind flow change requires simpler models to simulate faster. Instead of running numerical approximations resulting in detailed calculations, data-driven methods and deep learning might be able to give similar results in a fraction of the time. This work rephrases the problem from computing 3D flow fields using CFD to a 2D image-to-image translation-based problem on the building footprints to predict the flow field at pedestrian height level. We investigate the use of generative adversarial networks (GAN), such as Pix2Pix [1] and CycleGAN [2] representing state-of-the-art for image-to-image translation task in various domains as well as U-Net autoencoder [3]. The models can learn the underlying distribution of a dataset in a data-driven manner, which we argue can help the model learn the underlying Reynolds-averaged Navier-Stokes (RANS) equations from CFD. We experiment on novel simulated datasets on various three-dimensional bluff-shaped buildings with and without height information. Moreover, we present an extensive qualitative and quantitative evaluation of the generated images for a selection of models and compare their performance with the simulations delivered by CFD. We then show that adding positional data to the input can produce more accurate results by proposing a general framework for injecting such information on the different architectures. Furthermore, we show that the models performances improve by applying attention mechanisms and spectral normalization to facilitate stable training.
    Quantifying Uncertainty in Deep Learning Approaches to Radio Galaxy Classification. (arXiv:2201.01203v1 [astro-ph.CO])
    (0 min) In this work we use variational inference to quantify the degree of uncertainty in deep learning model predictions of radio galaxy classification. We show that the level of model posterior variance for individual test samples is correlated with human uncertainty when labelling radio galaxies. We explore the model performance and uncertainty calibration for a variety of different weight priors and suggest that a sparse prior produces more well-calibrated uncertainty estimates. Using the posterior distributions for individual weights, we show that we can prune 30% of the fully-connected layer weights without significant loss of performance by removing the weights with the lowest signal-to-noise ratio (SNR). We demonstrate that a larger degree of pruning can be achieved using a Fisher information based ranking, but we note that both pruning methods affect the uncertainty calibration for Fanaroff-Riley type I and type II radio galaxies differently. Finally we show that, like other work in this field, we experience a cold posterior effect, whereby the posterior must be down-weighted to achieve good predictive performance. We examine whether adapting the cost function to accommodate model misspecification can compensate for this effect, but find that it does not make a significant difference. We also examine the effect of principled data augmentation and find that this improves upon the baseline but also does not compensate for the observed effect. We interpret this as the cold posterior effect being due to the overly effective curation of our training sample leading to likelihood misspecification, and raise this as a potential issue for Bayesian deep learning approaches to radio galaxy classification in future.
    Delving into Sample Loss Curve to Embrace Noisy and Imbalanced Data. (arXiv:2201.00849v1 [cs.LG])
    (2 min) Corrupted labels and class imbalance are commonly encountered in practically collected training data, which easily leads to over-fitting of deep neural networks (DNNs). Existing approaches alleviate these issues by adopting a sample re-weighting strategy, which is to re-weight sample by designing weighting function. However, it is only applicable for training data containing only either one type of data biases. In practice, however, biased samples with corrupted labels and of tailed classes commonly co-exist in training data. How to handle them simultaneously is a key but under-explored problem. In this paper, we find that these two types of biased samples, though have similar transient loss, have distinguishable trend and characteristics in loss curves, which could provide valuable priors for sample weight assignment. Motivated by this, we delve into the loss curves and propose a novel probe-and-allocate training strategy: In the probing stage, we train the network on the whole biased training data without intervention, and record the loss curve of each sample as an additional attribute; In the allocating stage, we feed the resulting attribute to a newly designed curve-perception network, named CurveNet, to learn to identify the bias type of each sample and assign proper weights through meta-learning adaptively. The training speed of meta learning also blocks its application. To solve it, we propose a method named skip layer meta optimization (SLMO) to accelerate training speed by skipping the bottom layers. Extensive synthetic and real experiments well validate the proposed method, which achieves state-of-the-art performance on multiple challenging benchmarks.
    Interactive Attention AI to translate low light photos to captions for night scene understanding in women safety. (arXiv:2201.00969v1 [cs.CV])
    (0 min) There is amazing progress in Deep Learning based models for Image captioning and Low Light image enhancement. For the first time in literature, this paper develops a Deep Learning model that translates night scenes to sentences, opening new possibilities for AI applications in the safety of visually impaired women. Inspired by Image Captioning and Visual Question Answering, a novel Interactive Image Captioning is developed. A user can make the AI focus on any chosen person of interest by influencing the attention scoring. Attention context vectors are computed from CNN feature vectors and user-provided start word. The Encoder-Attention-Decoder neural network learns to produce captions from low brightness images. This paper demonstrates how women safety can be enabled by researching a novel AI capability in the Interactive Vision-Language model for perception of the environment in the night.
    Multimodal Co-learning: Challenges, Applications with Datasets, Recent Advances and Future Directions. (arXiv:2107.13782v3 [cs.LG] UPDATED)
    (0 min) Multimodal deep learning systems which employ multiple modalities like text, image, audio, video, etc., are showing better performance in comparison with individual modalities (i.e., unimodal) systems. Multimodal machine learning involves multiple aspects: representation, translation, alignment, fusion, and co-learning. In the current state of multimodal machine learning, the assumptions are that all modalities are present, aligned, and noiseless during training and testing time. However, in real-world tasks, typically, it is observed that one or more modalities are missing, noisy, lacking annotated data, have unreliable labels, and are scarce in training or testing and or both. This challenge is addressed by a learning paradigm called multimodal co-learning. The modeling of a (resource-poor) modality is aided by exploiting knowledge from another (resource-rich) modality using transfer of knowledge between modalities, including their representations and predictive models. Co-learning being an emerging area, there are no dedicated reviews explicitly focusing on all challenges addressed by co-learning. To that end, in this work, we provide a comprehensive survey on the emerging area of multimodal co-learning that has not been explored in its entirety yet. We review implementations that overcome one or more co-learning challenges without explicitly considering them as co-learning challenges. We present the comprehensive taxonomy of multimodal co-learning based on the challenges addressed by co-learning and associated implementations. The various techniques employed to include the latest ones are reviewed along with some of the applications and datasets. Our final goal is to discuss challenges and perspectives along with the important ideas and directions for future work that we hope to be beneficial for the entire research community focusing on this exciting domain.
    Super-resolution in Molecular Dynamics Trajectory Reconstruction with Bi-Directional Neural Networks. (arXiv:2201.01195v1 [physics.comp-ph])
    (0 min) Molecular dynamics simulations are a cornerstone in science, allowing to investigate from the system's thermodynamics to analyse intricate molecular interactions. In general, to create extended molecular trajectories can be a computationally expensive process, for example, when running $ab-initio$ simulations. Hence, repeating such calculations to either obtain more accurate thermodynamics or to get a higher resolution in the dynamics generated by a fine-grained quantum interaction can be time- and computationally-consuming. In this work, we explore different machine learning (ML) methodologies to increase the resolution of molecular dynamics trajectories on-demand within a post-processing step. As a proof of concept, we analyse the performance of bi-directional neural networks such as neural ODEs, Hamiltonian networks, recurrent neural networks and LSTMs, as well as the uni-directional variants as a reference, for molecular dynamics simulations (here: the MD17 dataset). We have found that Bi-LSTMs are the best performing models; by utilizing the local time-symmetry of thermostated trajectories they can even learn long-range correlations and display high robustness to noisy dynamics across molecular complexity. Our models can reach accuracies of up to 10$^{-4}$ angstroms in trajectory interpolation, while faithfully reconstructing several full cycles of unseen intricate high-frequency molecular vibrations, rendering the comparison between the learned and reference trajectories indistinguishable. The results reported in this work can serve (1) as a baseline for larger systems, as well as (2) for the construction of better MD integrators.
    Runway Extraction and Improved Mapping from Space Imagery. (arXiv:2201.00848v1 [cs.CV])
    (0 min) Change detection methods applied to monitoring key infrastructure like airport runways represent an important capability for disaster relief and urban planning. The present work identifies two generative adversarial networks (GAN) architectures that translate reversibly between plausible runway maps and satellite imagery. We illustrate the training capability using paired images (satellite-map) from the same point of view and using the Pix2Pix architecture or conditional GANs. In the absence of available pairs, we likewise show that CycleGAN architectures with four network heads (discriminator-generator pairs) can also provide effective style transfer from raw image pixels to outline or feature maps. To emphasize the runway and tarmac boundaries, we experimentally show that the traditional grey-tan map palette is not a required training input but can be augmented by higher contrast mapping palettes (red-black) for sharper runway boundaries. We preview a potentially novel use case (called "sketch2satellite") where a human roughly draws the current runway boundaries and automates the machine output of plausible satellite images. Finally, we identify examples of faulty runway maps where the published satellite and mapped runways disagree but an automated update renders the correct map using GANs.
    Multi-Representation Adaptation Network for Cross-domain Image Classification. (arXiv:2201.01002v1 [cs.CV])
    (0 min) In image classification, it is often expensive and time-consuming to acquire sufficient labels. To solve this problem, domain adaptation often provides an attractive option given a large amount of labeled data from a similar nature but different domain. Existing approaches mainly align the distributions of representations extracted by a single structure and the representations may only contain partial information, e.g., only contain part of the saturation, brightness, and hue information. Along this line, we propose Multi-Representation Adaptation which can dramatically improve the classification accuracy for cross-domain image classification and specially aims to align the distributions of multiple representations extracted by a hybrid structure named Inception Adaptation Module (IAM). Based on this, we present Multi-Representation Adaptation Network (MRAN) to accomplish the cross-domain image classification task via multi-representation alignment which can capture the information from different aspects. In addition, we extend Maximum Mean Discrepancy (MMD) to compute the adaptation loss. Our approach can be easily implemented by extending most feed-forward models with IAM, and the network can be trained efficiently via back-propagation. Experiments conducted on three benchmark image datasets demonstrate the effectiveness of MRAN. The code has been available at https://github.com/easezyc/deep-transfer-learning.
    Self-supervised representation learning from 12-lead ECG data. (arXiv:2103.12676v2 [eess.SP] UPDATED)
    (0 min) Clinical 12-lead electrocardiography (ECG) is one of the most widely encountered kinds of biosignals. Despite the increased availability of public ECG datasets, label scarcity remains a central challenge in the field. Self-supervised learning represents a promising way to alleviate this issue. In this work, we put forward the first comprehensive assessment of self-supervised representation learning from clinical 12-lead ECG data. To this end, we adapt state-of-the-art self-supervised methods based on instance discrimination and latent forecasting to the ECG domain. In a first step, we learn contrastive representations and evaluate their quality based on linear evaluation performance on a recently established, comprehensive, clinical ECG classification task. In a second step, we analyze the impact of self-supervised pretraining on finetuned ECG classifiers as compared to purely supervised performance. For the best-performing method, an adaptation of contrastive predictive coding, we find a linear evaluation performance only 0.5% below supervised performance. For the finetuned models, we find improvements in downstream performance of roughly 1% compared to supervised performance, label efficiency, as well as robustness against physiological noise. This work clearly establishes the feasibility of extracting discriminative representations from ECG data via self-supervised learning and the numerous advantages when finetuning such representations on downstream tasks as compared to purely supervised training. As first comprehensive assessment of its kind in the ECG domain carried out exclusively on publicly available datasets, we hope to establish a first step towards reproducible progress in the rapidly evolving field of representation learning for biosignals.
    Transfer Learning for Retinal Vascular Disease Detection: A Pilot Study with Diabetic Retinopathy and Retinopathy of Prematurity. (arXiv:2201.01250v1 [cs.LG])
    (0 min) Retinal vascular diseases affect the well-being of human body and sometimes provide vital signs of otherwise undetected bodily damage. Recently, deep learning techniques have been successfully applied for detection of diabetic retinopathy (DR). The main obstacle of applying deep learning techniques to detect most other retinal vascular diseases is the limited amount of data available. In this paper, we propose a transfer learning technique that aims to utilize the feature similarities for detecting retinal vascular diseases. We choose the well-studied DR detection as a source task and identify the early detection of retinopathy of prematurity (ROP) as the target task. Our experimental results demonstrate that our DR-pretrained approach dominates in all metrics the conventional ImageNet-pretrained transfer learning approach, currently adopted in medical image analysis. Moreover, our approach is more robust with respect to the stochasticity in the training process and with respect to reduced training samples. This study suggests the potential of our proposed transfer learning approach for a broad range of retinal vascular diseases or pathologies, where data is limited.
    Marginal likelihood computation for model selection and hypothesis testing: an extensive review. (arXiv:2005.08334v4 [stat.CO] UPDATED)
    (0 min) This is an up-to-date introduction to, and overview of, marginal likelihood computation for model selection and hypothesis testing. Computing normalizing constants of probability models (or ratio of constants) is a fundamental issue in many applications in statistics, applied mathematics, signal processing and machine learning. This article provides a comprehensive study of the state-of-the-art of the topic. We highlight limitations, benefits, connections and differences among the different techniques. Problems and possible solutions with the use of improper priors are also described. Some of the most relevant methodologies are compared through theoretical comparisons and numerical experiments.
    A Deeper Understanding of State-Based Critics in Multi-Agent Reinforcement Learning. (arXiv:2201.01221v1 [cs.LG])
    (0 min) Centralized Training for Decentralized Execution, where training is done in a centralized offline fashion, has become a popular solution paradigm in Multi-Agent Reinforcement Learning. Many such methods take the form of actor-critic with state-based critics, since centralized training allows access to the true system state, which can be useful during training despite not being available at execution time. State-based critics have become a common empirical choice, albeit one which has had limited theoretical justification or analysis. In this paper, we show that state-based critics can introduce bias in the policy gradient estimates, potentially undermining the asymptotic guarantees of the algorithm. We also show that, even if the state-based critics do not introduce any bias, they can still result in a larger gradient variance, contrary to the common intuition. Finally, we show the effects of the theories in practice by comparing different forms of centralized critics on a wide range of common benchmarks, and detail how various environmental properties are related to the effectiveness of different types of critics.
    Kendall transformation: a robust representation of continuous data for information theory. (arXiv:2006.15991v2 [cs.LG] UPDATED)
    (0 min) Kendall transformation is a conversion of an ordered feature into a vector of pairwise order relations between individual values. This way, it preserves ranking of observations and represents it in a categorical form. Such transformation allows for generalisation of methods requiring strictly categorical input, especially in the limit of small number of observations, when discretisation becomes problematic. In particular, many approaches of information theory can be directly applied to Kendall-transformed continuous data without relying on differential entropy or any additional parameters. Moreover, by filtering information to this contained in ranking, Kendall transformation leads to a better robustness at a reasonable cost of dropping sophisticated interactions which are anyhow unlikely to be correctly estimated. In bivariate analysis, Kendall transformation can be related to popular non-parametric methods, showing the soundness of the approach. The paper also demonstrates its efficiency in multivariate problems, as well as provides an example analysis of a real-world data.
    Modelling Cournot Games as Multi-agent Multi-armed Bandits. (arXiv:2201.01182v1 [cs.GT])
    (0 min) We investigate the use of a multi-agent multi-armed bandit (MA-MAB) setting for modeling repeated Cournot oligopoly games, where the firms acting as agents choose from the set of arms representing production quantity (a discrete value). Agents interact with separate and independent bandit problems. In this formulation, each agent makes sequential choices among arms to maximize its own reward. Agents do not have any information about the environment; they can only see their own rewards after taking an action. However, the market demand is a stationary function of total industry output, and random entry or exit from the market is not allowed. Given these assumptions, we found that an $\epsilon$-greedy approach offers a more viable learning mechanism than other traditional MAB approaches, as it does not require any additional knowledge of the system to operate. We also propose two novel approaches that take advantage of the ordered action space: $\epsilon$-greedy+HL and $\epsilon$-greedy+EL. These new approaches help firms to focus on more profitable actions by eliminating less profitable choices and hence are designed to optimize the exploration. We use computer simulations to study the emergence of various equilibria in the outcomes and do the empirical analysis of joint cumulative regrets.
    Efficient Quantum Feature Extraction for CNN-based Learning. (arXiv:2201.01246v1 [quant-ph])
    (0 min) Recent work has begun to explore the potential of parametrized quantum circuits (PQCs) as general function approximators. In this work, we propose a quantum-classical deep network structure to enhance classical CNN model discriminability. The convolutional layer uses linear filters to scan the input data. Moreover, we build PQC, which is a more potent function approximator, with more complex structures to capture the features within the receptive field. The feature maps are obtained by sliding the PQCs over the input in a similar way as CNN. We also give a training algorithm for the proposed model. The hybrid models used in our design are validated by numerical simulation. We demonstrate the reasonable classification performances on MNIST and we compare the performances with models in different settings. The results disclose that the model with ansatz in high expressibility achieves lower cost and higher accuracy.
    Learning Operators with Coupled Attention. (arXiv:2201.01032v1 [cs.LG])
    (0 min) Supervised operator learning is an emerging machine learning paradigm with applications to modeling the evolution of spatio-temporal dynamical systems and approximating general black-box relationships between functional data. We propose a novel operator learning method, LOCA (Learning Operators with Coupled Attention), motivated from the recent success of the attention mechanism. In our architecture, the input functions are mapped to a finite set of features which are then averaged with attention weights that depend on the output query locations. By coupling these attention weights together with an integral transform, LOCA is able to explicitly learn correlations in the target output functions, enabling us to approximate nonlinear operators even when the number of output function in the training set measurements is very small. Our formulation is accompanied by rigorous approximation theoretic guarantees on the universal expressiveness of the proposed model. Empirically, we evaluate the performance of LOCA on several operator learning scenarios involving systems governed by ordinary and partial differential equations, as well as a black-box climate prediction problem. Through these scenarios we demonstrate state of the art accuracy, robustness with respect to noisy input data, and a consistently small spread of errors over testing data sets, even for out-of-distribution prediction tasks.
    Self-directed Machine Learning. (arXiv:2201.01289v1 [cs.LG])
    (0 min) Conventional machine learning (ML) relies heavily on manual design from machine learning experts to decide learning tasks, data, models, optimization algorithms, and evaluation metrics, which is labor-intensive, time-consuming, and cannot learn autonomously like humans. In education science, self-directed learning, where human learners select learning tasks and materials on their own without requiring hands-on guidance, has been shown to be more effective than passive teacher-guided learning. Inspired by the concept of self-directed human learning, we introduce the principal concept of Self-directed Machine Learning (SDML) and propose a framework for SDML. Specifically, we design SDML as a self-directed learning process guided by self-awareness, including internal awareness and external awareness. Our proposed SDML process benefits from self task selection, self data selection, self model selection, self optimization strategy selection and self evaluation metric selection through self-awareness without human guidance. Meanwhile, the learning performance of the SDML process serves as feedback to further improve self-awareness. We propose a mathematical formulation for SDML based on multi-level optimization. Furthermore, we present case studies together with potential applications of SDML, followed by discussing future research directions. We expect that SDML could enable machines to conduct human-like self-directed learning and provide a new perspective towards artificial general intelligence.
    Deep learning architectures for nonlinear operator functions and nonlinear inverse problems. (arXiv:1912.11090v3 [math.OC] UPDATED)
    (0 min) We develop a theoretical analysis for special neural network architectures, termed operator recurrent neural networks, for approximating nonlinear functions whose inputs are linear operators. Such functions commonly arise in solution algorithms for inverse boundary value problems. Traditional neural networks treat input data as vectors, and thus they do not effectively capture the multiplicative structure associated with the linear operators that correspond to the data in such inverse problems. We therefore introduce a new family that resembles a standard neural network architecture, but where the input data acts multiplicatively on vectors. Motivated by compact operators appearing in boundary control and the analysis of inverse boundary value problems for the wave equation, we promote structure and sparsity in selected weight matrices in the network. After describing this architecture, we study its representation properties as well as its approximation properties. We furthermore show that an explicit regularization can be introduced that can be derived from the mathematical analysis of the mentioned inverse problems, and which leads to certain guarantees on the generalization properties. We observe that the sparsity of the weight matrices improves the generalization estimates. Lastly, we discuss how operator recurrent networks can be viewed as a deep learning analogue to deterministic algorithms such as boundary control for reconstructing the unknown wavespeed in the acoustic wave equation from boundary measurements.
    McXai: Local model-agnostic explanation as two games. (arXiv:2201.01044v1 [cs.LG])
    (0 min) To this day, a variety of approaches for providing local interpretability of black-box machine learning models have been introduced. Unfortunately, all of these methods suffer from one or more of the following deficiencies: They are either difficult to understand themselves, they work on a per-feature basis and ignore the dependencies between features and/or they only focus on those features asserting the decision made by the model. To address these points, this work introduces a reinforcement learning-based approach called Monte Carlo tree search for eXplainable Artificial Intelligent (McXai) to explain the decisions of any black-box classification model (classifier). Our method leverages Monte Carlo tree search and models the process of generating explanations as two games. In one game, the reward is maximized by finding feature sets that support the decision of the classifier, while in the second game, finding feature sets leading to alternative decisions maximizes the reward. The result is a human friendly representation as a tree structure, in which each node represents a set of features to be studied with smaller explanations at the top of the tree. Our experiments show, that the features found by our method are more informative with respect to classifications than those found by classical approaches like LIME and SHAP. Furthermore, by also identifying misleading features, our approach is able to guide towards improved robustness of the black-box model in many situations.
    Swin UNETR: Swin Transformers for Semantic Segmentation of Brain Tumors in MRI Images. (arXiv:2201.01266v1 [eess.IV])
    (0 min) Semantic segmentation of brain tumors is a fundamental medical image analysis task involving multiple MRI imaging modalities that can assist clinicians in diagnosing the patient and successively studying the progression of the malignant entity. In recent years, Fully Convolutional Neural Networks (FCNNs) approaches have become the de facto standard for 3D medical image segmentation. The popular "U-shaped" network architecture has achieved state-of-the-art performance benchmarks on different 2D and 3D semantic segmentation tasks and across various imaging modalities. However, due to the limited kernel size of convolution layers in FCNNs, their performance of modeling long-range information is sub-optimal, and this can lead to deficiencies in the segmentation of tumors with variable sizes. On the other hand, transformer models have demonstrated excellent capabilities in capturing such long-range information in multiple domains, including natural language processing and computer vision. Inspired by the success of vision transformers and their variants, we propose a novel segmentation model termed Swin UNEt TRansformers (Swin UNETR). Specifically, the task of 3D brain tumor semantic segmentation is reformulated as a sequence to sequence prediction problem wherein multi-modal input data is projected into a 1D sequence of embedding and used as an input to a hierarchical Swin transformer as the encoder. The swin transformer encoder extracts features at five different resolutions by utilizing shifted windows for computing self-attention and is connected to an FCNN-based decoder at each resolution via skip connections. We have participated in BraTS 2021 segmentation challenge, and our proposed model ranks among the top-performing approaches in the validation phase. Code: https://monai.io/research/swin-unetr
    Parity-based Cumulative Fairness-aware Boosting. (arXiv:2201.01148v1 [cs.LG])
    (0 min) Data-driven AI systems can lead to discrimination on the basis of protected attributes like gender or race. One reason for this behavior is the encoded societal biases in the training data (e.g., females are underrepresented), which is aggravated in the presence of unbalanced class distributions (e.g., "granted" is the minority class). State-of-the-art fairness-aware machine learning approaches focus on preserving the \emph{overall} classification accuracy while improving fairness. In the presence of class-imbalance, such methods may further aggravate the problem of discrimination by denying an already underrepresented group (e.g., \textit{females}) the fundamental rights of equal social privileges (e.g., equal credit opportunity). To this end, we propose AdaFair, a fairness-aware boosting ensemble that changes the data distribution at each round, taking into account not only the class errors but also the fairness-related performance of the model defined cumulatively based on the partial ensemble. Except for the in-training boosting of the group discriminated over each round, AdaFair directly tackles imbalance during the post-training phase by optimizing the number of ensemble learners for balanced error performance (BER). AdaFair can facilitate different parity-based fairness notions and mitigate effectively discriminatory outcomes. Our experiments show that our approach can achieve parity in terms of statistical parity, equal opportunity, and disparate mistreatment while maintaining good predictive performance for all classes.
    Neural Mean Discrepancy for Efficient Out-of-Distribution Detection. (arXiv:2104.11408v3 [cs.LG] UPDATED)
    (0 min) Various approaches have been proposed for out-of-distribution (OOD) detection by augmenting models, input examples, training sets, and optimization objectives. Deviating from existing work, we have a simple hypothesis that standard off-the-shelf models may already contain sufficient information about the training set distribution which can be leveraged for reliable OOD detection. Our empirical study on validating this hypothesis, which measures the model activation's mean for OOD and in-distribution (ID) mini-batches, surprisingly finds that activation means of OOD mini-batches consistently deviate more from those of the training data. In addition, training data's activation means can be computed offline efficiently or retrieved from batch normalization layers as a `free lunch'. Based upon this observation, we propose a novel metric called Neural Mean Discrepancy (NMD), which compares neural means of the input examples and training data. Leveraging the simplicity of NMD, we propose an efficient OOD detector that computes neural means by a standard forward pass followed by a lightweight classifier. Extensive experiments show that NMD outperforms state-of-the-art OOD approaches across multiple datasets and model architectures in terms of both detection accuracy and computational cost.
    Profile Guided Optimization without Profiles: A Machine Learning Approach. (arXiv:2112.14679v2 [cs.PL] UPDATED)
    (0 min) Profile guided optimization is an effective technique for improving the optimization ability of compilers based on dynamic behavior, but collecting profile data is expensive, cumbersome, and requires regular updating to remain fresh. We present a novel statistical approach to inferring branch probabilities that improves the performance of programs that are compiled without profile guided optimizations. We perform offline training using information that is collected from a large corpus of binaries that have branch probabilities information. The learned model is used by the compiler to predict the branch probabilities of regular uninstrumented programs, which the compiler can then use to inform optimization decisions. We integrate our technique directly in LLVM, supplementing the existing human-engineered compiler heuristics. We evaluate our technique on a suite of benchmarks, demonstrating some gains over compiling without profile information. In deployment, our technique requires no profiling runs and has negligible effect on compilation time.
    Modeling Users' Behavior Sequences with Hierarchical Explainable Network for Cross-domain Fraud Detection. (arXiv:2201.01004v1 [cs.LG])
    (0 min) With the explosive growth of the e-commerce industry, detecting online transaction fraud in real-world applications has become increasingly important to the development of e-commerce platforms. The sequential behavior history of users provides useful information in differentiating fraudulent payments from regular ones. Recently, some approaches have been proposed to solve this sequence-based fraud detection problem. However, these methods usually suffer from two problems: the prediction results are difficult to explain and the exploitation of the internal information of behaviors is insufficient. To tackle the above two problems, we propose a Hierarchical Explainable Network (HEN) to model users' behavior sequences, which could not only improve the performance of fraud detection but also make the inference process interpretable. Meanwhile, as e-commerce business expands to new domains, e.g., new countries or new markets, one major problem for modeling user behavior in fraud detection systems is the limitation of data collection, e.g., very few data/labels available. Thus, in this paper, we further propose a transfer framework to tackle the cross-domain fraud detection problem, which aims to transfer knowledge from existing domains (source domains) with enough and mature data to improve the performance in the new domain (target domain). Our proposed method is a general transfer framework that could not only be applied upon HEN but also various existing models in the Embedding & MLP paradigm. Based on 90 transfer task experiments, we also demonstrate that our transfer framework could not only contribute to the cross-domain fraud detection task with HEN, but also be universal and expandable for various existing models.
    Multivariate Time Series Regression with Graph Neural Networks. (arXiv:2201.00818v1 [cs.LG])
    (0 min) Machine learning, with its advances in Deep Learning has shown great potential in analysing time series in the past. However, in many scenarios, additional information is available that can potentially improve predictions, by incorporating it into the learning methods. This is crucial for data that arises from e.g., sensor networks that contain information about sensor locations. Then, such spatial information can be exploited by modeling it via graph structures, along with the sequential (time) information. Recent advances in adapting Deep Learning to graphs have shown promising potential in various graph-related tasks. However, these methods have not been adapted for time series related tasks to a great extent. Specifically, most attempts have essentially consolidated around Spatial-Temporal Graph Neural Networks for time series forecasting with small sequence lengths. Generally, these architectures are not suited for regression or classification tasks that contain large sequences of data. Therefore, in this work, we propose an architecture capable of processing these long sequences in a multivariate time series regression task, using the benefits of Graph Neural Networks to improve predictions. Our model is tested on two seismic datasets that contain earthquake waveforms, where the goal is to predict intensity measurements of ground shaking at a set of stations. Our findings demonstrate promising results of our approach, which are discussed in depth with an additional ablation study.
    Value Functions Factorization with Latent State Information Sharing in Decentralized Multi-Agent Policy Gradients. (arXiv:2201.01247v1 [cs.MA])
    (0 min) Value function factorization via centralized training and decentralized execution is promising for solving cooperative multi-agent reinforcement tasks. One of the approaches in this area, QMIX, has become state-of-the-art and achieved the best performance on the StarCraft II micromanagement benchmark. However, the monotonic-mixing of per agent estimates in QMIX is known to restrict the joint action Q-values it can represent, as well as the insufficient global state information for single agent value function estimation, often resulting in suboptimality. To this end, we present LSF-SAC, a novel framework that features a variational inference-based information-sharing mechanism as extra state information to assist individual agents in the value function factorization. We demonstrate that such latent individual state information sharing can significantly expand the power of value function factorization, while fully decentralized execution can still be maintained in LSF-SAC through a soft-actor-critic design. We evaluate LSF-SAC on the StarCraft II micromanagement challenge and demonstrate that it outperforms several state-of-the-art methods in challenging collaborative tasks. We further set extensive ablation studies for locating the key factors accounting for its performance improvements. We believe that this new insight can lead to new local value estimation methods and variational deep learning algorithms. A demo video and code of implementation can be found at https://sites.google.com/view/sacmm.
    Automated Graph Machine Learning: Approaches, Libraries and Directions. (arXiv:2201.01288v1 [cs.LG])
    (0 min) Graph machine learning has been extensively studied in both academic and industry. However, as the literature on graph learning booms with a vast number of emerging methods and techniques, it becomes increasingly difficult to manually design the optimal machine learning algorithm for different graph-related tasks. To tackle the challenge, automated graph machine learning, which aims at discovering the best hyper-parameter and neural architecture configuration for different graph tasks/data without manual design, is gaining an increasing number of attentions from the research community. In this paper, we extensively discuss automated graph machine approaches, covering hyper-parameter optimization (HPO) and neural architecture search (NAS) for graph machine learning. We briefly overview existing libraries designed for either graph machine learning or automated machine learning respectively, and further in depth introduce AutoGL, our dedicated and the world's first open-source library for automated graph machine learning. Last but not least, we share our insights on future research directions for automated graph machine learning. This paper is the first systematic and comprehensive discussion of approaches, libraries as well as directions for automated graph machine learning.
    Efficient Multi-Organ Segmentation Using SpatialConfiguration-Net with Low GPU Memory Requirements. (arXiv:2111.13630v2 [eess.IV] UPDATED)
    (0 min) Even though many semantic segmentation methods exist that are able to perform well on many medical datasets, often, they are not designed for direct use in clinical practice. The two main concerns are generalization to unseen data with a different visual appearance, e.g., images acquired using a different scanner, and efficiency in terms of computation time and required Graphics Processing Unit (GPU) memory. In this work, we employ a multi-organ segmentation model based on the SpatialConfiguration-Net (SCN), which integrates prior knowledge of the spatial configuration among the labelled organs to resolve spurious responses in the network outputs. Furthermore, we modified the architecture of the segmentation model to reduce its memory footprint as much as possible without drastically impacting the quality of the predictions. Lastly, we implemented a minimal inference script for which we optimized both, execution time and required GPU memory.
    Resilience Aspects in Distributed Wireless Electroencephalographic Sampling. (arXiv:2201.01272v1 [eess.SP])
    (0 min) Resilience aspects of remote electroencephalography sampling are considered. The possibility to use motion sensors data and measurement of industrial power network interference for detection of failed sampling channels is demonstrated. No significant correlation between signals of failed channels and motion sensors data is shown. Level of 50 Hz spectral component from failed channels significantly differs from level of 50 Hz component of normally operating channel. Conclusions about application of these results for increasing resilience of electroencephalography sampling is made.
    Compensation Learning. (arXiv:2107.11921v4 [cs.LG] UPDATED)
    (0 min) Weighting strategy prevails in machine learning. For example, a common approach in robust machine learning is to exert lower weights on samples which are likely to be noisy or quite hard. This study reveals another undiscovered strategy, namely, compensating. Various incarnations of compensating have been utilized but it has not been explicitly revealed. Learning with compensating is called compensation learning and a systematic taxonomy is constructed for it in this study. In our taxonomy, compensation learning is divided on the basis of the compensation targets, directions, inference manners, and granularity levels. Many existing learning algorithms including some classical ones can be viewed or understood at least partially as compensation techniques. Furthermore, a family of new learning algorithms can be obtained by plugging the compensation learning into existing learning algorithms. Specifically, two concrete new learning algorithms are proposed for robust machine learning. Extensive experiments on image classification and text sentiment analysis verify the effectiveness of the two new algorithms. Compensation learning can also be used in other various learning scenarios, such as imbalance learning, clustering, regression, and so on.
    Common Misconceptions about Population Data. (arXiv:2112.10912v2 [cs.DB] UPDATED)
    (0 min) Databases covering all individuals of a population are increasingly used for research studies in domains ranging from public health to the social sciences. There is also growing interest by governments and businesses to use population data to support data-driven decision making. The massive size of such databases is often mistaken as a guarantee for valid inferences on the population of interest. However, population data have characteristics that make them challenging to use, including various assumptions being made how such data were collected and what types of processing have been applied to them. Furthermore, the full potential of population data can often only be unlocked when such data are linked to other databases, a process that adds fresh challenges. This article discusses a diverse range of misconceptions about population data that we believe anybody who works with such data needs to be aware of. Many of these misconceptions are not well documented in scientific publications but only discussed anecdotally among researchers and practitioners. We conclude with a set of recommendations for inference when using population data.
    Building a great multi-lingual teacher with sparsely-gated mixture of experts for speech recognition. (arXiv:2112.05820v3 [cs.CL] UPDATED)
    (0 min) The sparsely-gated Mixture of Experts (MoE) can magnify a network capacity with a little computational complexity. In this work, we investigate how multi-lingual Automatic Speech Recognition (ASR) networks can be scaled up with a simple routing algorithm in order to achieve better accuracy. More specifically, we apply the sparsely-gated MoE technique to two types of networks: Sequence-to-Sequence Transformer (S2S-T) and Transformer Transducer (T-T). We demonstrate through a set of ASR experiments on multiple language data that the MoE networks can reduce the relative word error rates by 16.3% and 4.6% with the S2S-T and T-T, respectively. Moreover, we thoroughly investigate the effect of the MoE on the T-T architecture in various conditions: streaming mode, non-streaming mode, the use of language ID and the label decoder with the MoE.
    Robust Semi-supervised Federated Learning for Images Automatic Recognition in Internet of Drones. (arXiv:2201.01230v1 [cs.LG])
    (0 min) Air access networks have been recognized as a significant driver of various Internet of Things (IoT) services and applications. In particular, the aerial computing network infrastructure centered on the Internet of Drones has set off a new revolution in automatic image recognition. This emerging technology relies on sharing ground truth labeled data between Unmanned Aerial Vehicle (UAV) swarms to train a high-quality automatic image recognition model. However, such an approach will bring data privacy and data availability challenges. To address these issues, we first present a Semi-supervised Federated Learning (SSFL) framework for privacy-preserving UAV image recognition. Specifically, we propose model parameters mixing strategy to improve the naive combination of FL and semi-supervised learning methods under two realistic scenarios (labels-at-client and labels-at-server), which is referred to as Federated Mixing (FedMix). Furthermore, there are significant differences in the number, features, and distribution of local data collected by UAVs using different camera modules in different environments, i.e., statistical heterogeneity. To alleviate the statistical heterogeneity problem, we propose an aggregation rule based on the frequency of the client's participation in training, namely the FedFreq aggregation rule, which can adjust the weight of the corresponding local model according to its frequency. Numerical results demonstrate that the performance of our proposed method is significantly better than those of the current baseline and is robust to different non-IID levels of client data.
    Two-level Graph Neural Network. (arXiv:2201.01190v1 [cs.LG])
    (0 min) Graph Neural Networks (GNNs) are recently proposed neural network structures for the processing of graph-structured data. Due to their employed neighbor aggregation strategy, existing GNNs focus on capturing node-level information and neglect high-level information. Existing GNNs therefore suffer from representational limitations caused by the Local Permutation Invariance (LPI) problem. To overcome these limitations and enrich the features captured by GNNs, we propose a novel GNN framework, referred to as the Two-level GNN (TL-GNN). This merges subgraph-level information with node-level information. Moreover, we provide a mathematical analysis of the LPI problem which demonstrates that subgraph-level information is beneficial to overcoming the problems associated with LPI. A subgraph counting method based on the dynamic programming algorithm is also proposed, and this has time complexity is O(n^3), n is the number of nodes of a graph. Experiments show that TL-GNN outperforms existing GNNs and achieves state-of-the-art performance.
    Combat Data Shift in Few-shot Learning with Knowledge Graph. (arXiv:2101.11354v3 [cs.LG] UPDATED)
    (0 min) Many few-shot learning approaches have been designed under the meta-learning framework, which learns from a variety of learning tasks and generalizes to new tasks. These meta-learning approaches achieve the expected performance in the scenario where all samples are drawn from the same distributions (i.i.d. observations). However, in real-world applications, few-shot learning paradigm often suffers from data shift, i.e., samples in different tasks, even in the same task, could be drawn from various data distributions. Most existing few-shot learning approaches are not designed with the consideration of data shift, and thus show downgraded performance when data distribution shifts. However, it is non-trivial to address the data shift problem in few-shot learning, due to the limited number of labeled samples in each task. Targeting at addressing this problem, we propose a novel metric-based meta-learning framework to extract task-specific representations and task-shared representations with the help of knowledge graph. The data shift within/between tasks can thus be combated by the combination of task-shared and task-specific representations. The proposed model is evaluated on popular benchmarks and two constructed new challenging datasets. The evaluation results demonstrate its remarkable performance.
    Exploring Architectural Ingredients of Adversarially Robust Deep Neural Networks. (arXiv:2110.03825v4 [cs.LG] UPDATED)
    (0 min) Deep neural networks (DNNs) are known to be vulnerable to adversarial attacks. A range of defense methods have been proposed to train adversarially robust DNNs, among which adversarial training has demonstrated promising results. However, despite preliminary understandings developed for adversarial training, it is still not clear, from the architectural perspective, what configurations can lead to more robust DNNs. In this paper, we address this gap via a comprehensive investigation on the impact of network width and depth on the robustness of adversarially trained DNNs. Specifically, we make the following key observations: 1) more parameters (higher model capacity) does not necessarily help adversarial robustness; 2) reducing capacity at the last stage (the last group of blocks) of the network can actually improve adversarial robustness; and 3) under the same parameter budget, there exists an optimal architectural configuration for adversarial robustness. We also provide a theoretical analysis explaning why such network configuration can help robustness. These architectural insights can help design adversarially robust DNNs. Code is available at \url{https://github.com/HanxunH/RobustWRN}.
    PluGeN: Multi-Label Conditional Generation From Pre-Trained Models. (arXiv:2109.09011v2 [cs.LG] UPDATED)
    (2 min) Modern generative models achieve excellent quality in a variety of tasks including image or text generation and chemical molecule modeling. However, existing methods often lack the essential ability to generate examples with requested properties, such as the age of the person in the photo or the weight of the generated molecule. Incorporating such additional conditioning factors would require rebuilding the entire architecture and optimizing the parameters from scratch. Moreover, it is difficult to disentangle selected attributes so that to perform edits of only one attribute while leaving the others unchanged. To overcome these limitations we propose PluGeN (Plugin Generative Network), a simple yet effective generative technique that can be used as a plugin to pre-trained generative models. The idea behind our approach is to transform the entangled latent representation using a flow-based module into a multi-dimensional space where the values of each attribute are modeled as an independent one-dimensional distribution. In consequence, PluGeN can generate new samples with desired attributes as well as manipulate labeled attributes of existing examples. Due to the disentangling of the latent representation, we are even able to generate samples with rare or unseen combinations of attributes in the dataset, such as a young person with gray hair, men with make-up, or women with beards. We combined PluGeN with GAN and VAE models and applied it to conditional generation and manipulation of images and chemical molecule modeling. Experiments demonstrate that PluGeN preserves the quality of backbone models while adding the ability to control the values of labeled attributes.
    Accelerated Zeroth-Order and First-Order Momentum Methods from Mini to Minimax Optimization. (arXiv:2008.08170v5 [math.OC] UPDATED)
    (0 min) In the paper, we propose a class of accelerated zeroth-order and first-order momentum methods for both nonconvex mini-optimization and minimax-optimization. Specifically, we propose a new accelerated zeroth-order momentum (Acc-ZOM) method for black-box mini-optimization. Moreover, we prove that our Acc-ZOM method achieves a lower query complexity of $\tilde{O}(d^{3/4}\epsilon^{-3})$ for finding an $\epsilon$-stationary point, which improves the best known result by a factor of $O(d^{1/4})$ where $d$ denotes the variable dimension. In particular, the Acc-ZOM does not require large batches required in the existing zeroth-order stochastic algorithms. Meanwhile, we propose an accelerated \textbf{zeroth-order} momentum descent ascent (Acc-ZOMDA) method for \textbf{black-box} minimax-optimization, which obtains a query complexity of $\tilde{O}((d_1+d_2)^{3/4}\kappa_y^{4.5}\epsilon^{-3})$ without large batches for finding an $\epsilon$-stationary point, where $d_1$ and $d_2$ denote variable dimensions and $\kappa_y$ is condition number. Moreover, we propose an accelerated \textbf{first-order} momentum descent ascent (Acc-MDA) method for \textbf{white-box} minimax optimization, which has a gradient complexity of $\tilde{O}(\kappa_y^{4.5}\epsilon^{-3})$ without large batches for finding an $\epsilon$-stationary point. In particular, our Acc-MDA can obtain a lower gradient complexity of $\tilde{O}(\kappa_y^{2.5}\epsilon^{-3})$ with a batch size $O(\kappa_y^4)$. Extensive experimental results on the black-box adversarial attack to deep neural networks (DNNs) and poisoning attack demonstrate efficiency of our algorithms.
    Learning to Generate Novel Classes for Deep Metric Learning. (arXiv:2201.01008v1 [cs.CV])
    (0 min) Deep metric learning aims to learn an embedding space where the distance between data reflects their class equivalence, even when their classes are unseen during training. However, the limited number of classes available in training precludes generalization of the learned embedding space. Motivated by this, we introduce a new data augmentation approach that synthesizes novel classes and their embedding vectors. Our approach can provide rich semantic information to an embedding model and improve its generalization by augmenting training data with novel classes unavailable in the original data. We implement this idea by learning and exploiting a conditional generative model, which, given a class label and a noise, produces a random embedding vector of the class. Our proposed generator allows the loss to use richer class relations by augmenting realistic and diverse classes, resulting in better generalization to unseen samples. Experimental results on public benchmark datasets demonstrate that our method clearly enhances the performance of proxy-based losses.
    Semi-relaxed Gromov-Wasserstein divergence with applications on graphs. (arXiv:2110.02753v2 [cs.LG] UPDATED)
    (0 min) Comparing structured objects such as graphs is a fundamental operation involved in many learning tasks. To this end, the Gromov-Wasserstein (GW) distance, based on Optimal Transport (OT), has proven to be successful in handling the specific nature of the associated objects. More specifically, through the nodes connectivity relations, GW operates on graphs, seen as probability measures over specific spaces. At the core of OT is the idea of conservation of mass, which imposes a coupling between all the nodes from the two considered graphs. We argue in this paper that this property can be detrimental for tasks such as graph dictionary or partition learning, and we relax it by proposing a new semi-relaxed Gromov-Wasserstein divergence. Aside from immediate computational benefits, we discuss its properties, and show that it can lead to an efficient graph dictionary learning algorithm. We empirically demonstrate its relevance for complex tasks on graphs such as partitioning, clustering and completion.
    Be Confident! Towards Trustworthy Graph Neural Networks via Confidence Calibration. (arXiv:2109.14285v3 [cs.LG] UPDATED)
    (0 min) Despite Graph Neural Networks (GNNs) have achieved remarkable accuracy, whether the results are trustworthy is still unexplored. Previous studies suggest that many modern neural networks are over-confident on the predictions, however, surprisingly, we discover that GNNs are primarily in the opposite direction, i.e., GNNs are under-confident. Therefore, the confidence calibration for GNNs is highly desired. In this paper, we propose a novel trustworthy GNN model by designing a topology-aware post-hoc calibration function. Specifically, we first verify that the confidence distribution in a graph has homophily property, and this finding inspires us to design a calibration GNN model (CaGCN) to learn the calibration function. CaGCN is able to obtain a unique transformation from logits of GNNs to the calibrated confidence for each node, meanwhile, such transformation is able to preserve the order between classes, satisfying the accuracy-preserving property. Moreover, we apply the calibration GNN to self-training framework, showing that more trustworthy pseudo labels can be obtained with the calibrated confidence and further improve the performance. Extensive experiments demonstrate the effectiveness of our proposed model in terms of both calibration and accuracy.
    Learning to Play No-Press Diplomacy with Best Response Policy Iteration. (arXiv:2006.04635v4 [cs.LG] UPDATED)
    (2 min) Recent advances in deep reinforcement learning (RL) have led to considerable progress in many 2-player zero-sum games, such as Go, Poker and Starcraft. The purely adversarial nature of such games allows for conceptually simple and principled application of RL methods. However real-world settings are many-agent, and agent interactions are complex mixtures of common-interest and competitive aspects. We consider Diplomacy, a 7-player board game designed to accentuate dilemmas resulting from many-agent interactions. It also features a large combinatorial action space and simultaneous moves, which are challenging for RL algorithms. We propose a simple yet effective approximate best response operator, designed to handle large combinatorial action spaces and simultaneous moves. We also introduce a family of policy iteration methods that approximate fictitious play. With these methods, we successfully apply RL to Diplomacy: we show that our agents convincingly outperform the previous state-of-the-art, and game theoretic equilibrium analysis shows that the new process yields consistent improvements.
    A Survey On Universal Adversarial Attack. (arXiv:2103.01498v2 [cs.LG] UPDATED)
    (2 min) The intriguing phenomenon of adversarial examples has attracted significant attention in machine learning and what might be more surprising to the community is the existence of universal adversarial perturbations (UAPs), i.e. a single perturbation to fool the target DNN for most images. With the focus on UAP against deep classifiers, this survey summarizes the recent progress on universal adversarial attacks, discussing the challenges from both the attack and defense sides, as well as the reason for the existence of UAP. We aim to extend this work as a dynamic survey that will regularly update its content to follow new works regarding UAP or universal attack in a wide range of domains, such as image, audio, video, text, etc. Relevant updates will be discussed at: https://bit.ly/2SbQlLG. We welcome authors of future works in this field to contact us for including your new finding.
    On the effectiveness of adversarial training against common corruptions. (arXiv:2103.02325v2 [cs.LG] UPDATED)
    (2 min) The literature on robustness towards common corruptions shows no consensus on whether adversarial training can improve the performance in this setting. First, we show that, when used with an appropriately selected perturbation radius, $\ell_p$ adversarial training can serve as a strong baseline against common corruptions improving both accuracy and calibration. Then we explain why adversarial training performs better than data augmentation with simple Gaussian noise which has been observed to be a meaningful baseline on common corruptions. Related to this, we identify the $\sigma$-overfitting phenomenon when Gaussian augmentation overfits to a particular standard deviation used for training which has a significant detrimental effect on common corruption accuracy. We discuss how to alleviate this problem and then how to further enhance $\ell_p$ adversarial training by introducing an efficient relaxation of adversarial training with learned perceptual image patch similarity as the distance metric. Through experiments on CIFAR-10 and ImageNet-100, we show that our approach does not only improve the $\ell_p$ adversarial training baseline but also has cumulative gains with data augmentation methods such as AugMix, DeepAugment, ANT, and SIN, leading to state-of-the-art performance on common corruptions. The code of our experiments is publicly available at https://github.com/tml-epfl/adv-training-corruptions.
    3DVSR: 3D EPI Volume-based Approach for Angular and Spatial Light field Image Super-resolution. (arXiv:2201.01294v1 [cs.CV])
    (2 min) Light field (LF) imaging, which captures both spatial and angular information of a scene, is undoubtedly beneficial to numerous applications. Although various techniques have been proposed for LF acquisition, achieving both angularly and spatially high-resolution LF remains a technology challenge. In this paper, a learning-based approach applied to 3D epipolar image (EPI) is proposed to reconstruct high-resolution LF. Through a 2-stage super-resolution framework, the proposed approach effectively addresses various LF super-resolution (SR) problems, i.e., spatial SR, angular SR, and angular-spatial SR. While the first stage provides flexible options to up-sample EPI volume to the desired resolution, the second stage, which consists of a novel EPI volume-based refinement network (EVRN), substantially enhances the quality of the high-resolution EPI volume. An extensive evaluation on 90 challenging synthetic and real-world light field scenes from 7 published datasets shows that the proposed approach outperforms state-of-the-art methods to a large extend for both spatial and angular super-resolution problem, i.e., an average peak signal to noise ratio improvement of more than 2.0 dB, 1.4 dB, and 3.14 dB in spatial SR $\times 2$, spatial SR $\times 4$, and angular SR respectively. The reconstructed 4D light field demonstrates a balanced performance distribution across all perspective images and presents superior visual quality compared to the previous works.
    Predicting Influenza A Viral Host Using PSSM and Word Embeddings. (arXiv:2201.01140v1 [cs.CL])
    (2 min) The rapid mutation of the influenza virus threatens public health. Reassortment among viruses with different hosts can lead to a fatal pandemic. However, it is difficult to detect the original host of the virus during or after an outbreak as influenza viruses can circulate between different species. Therefore, early and rapid detection of the viral host would help reduce the further spread of the virus. We use various machine learning models with features derived from the position-specific scoring matrix (PSSM) and features learned from word embedding and word encoding to infer the origin host of viruses. The results show that the performance of the PSSM-based model reaches the MCC around 95%, and the F1 around 96%. The MCC obtained using the model with word embedding is around 96%, and the F1 is around 97%.
    Visual Microfossil Identification via Deep Metric Learning. (arXiv:2112.09490v2 [cs.CV] UPDATED)
    (0 min) We apply deep metric learning for the first time to the prob-lem of classifying planktic foraminifer shells on microscopic images. This species recognition task is an important information source and scientific pillar for reconstructing past climates. All foraminifer CNN recognition pipelines in the literature produce black-box classifiers that lack visualisation options for human experts and cannot be applied to open set problems. Here, we benchmark metric learning against these pipelines, produce the first scientific visualisation of the phenotypic planktic foraminifer morphology space, and demonstrate that metric learning can be used to cluster species unseen during training. We show that metric learning out-performs all published CNN-based state-of-the-art benchmarks in this domain. We evaluate our approach on the 34,640 expert-annotated images of the Endless Forams public library of 35 modern planktic foraminifera species. Our results on this data show leading 92% accuracy (at 0.84 F1-score) in reproducing expert labels on withheld test data, and 66.5% accuracy (at 0.70 F1-score) when clustering species never encountered in training. We conclude that metric learning is highly effective for this domain and serves as an important tool towards expert-in-the-loop automation of microfossil identification. Key code, network weights, and data splits are published with this paper for full reproducibility.
    ExAID: A Multimodal Explanation Framework for Computer-Aided Diagnosis of Skin Lesions. (arXiv:2201.01249v1 [cs.AI])
    (2 min) One principal impediment in the successful deployment of AI-based Computer-Aided Diagnosis (CAD) systems in clinical workflows is their lack of transparent decision making. Although commonly used eXplainable AI methods provide some insight into opaque algorithms, such explanations are usually convoluted and not readily comprehensible except by highly trained experts. The explanation of decisions regarding the malignancy of skin lesions from dermoscopic images demands particular clarity, as the underlying medical problem definition is itself ambiguous. This work presents ExAID (Explainable AI for Dermatology), a novel framework for biomedical image analysis, providing multi-modal concept-based explanations consisting of easy-to-understand textual explanations supplemented by visual maps justifying the predictions. ExAID relies on Concept Activation Vectors to map human concepts to those learnt by arbitrary Deep Learning models in latent space, and Concept Localization Maps to highlight concepts in the input space. This identification of relevant concepts is then used to construct fine-grained textual explanations supplemented by concept-wise location information to provide comprehensive and coherent multi-modal explanations. All information is comprehensively presented in a diagnostic interface for use in clinical routines. An educational mode provides dataset-level explanation statistics and tools for data and model exploration to aid medical research and education. Through rigorous quantitative and qualitative evaluation of ExAID, we show the utility of multi-modal explanations for CAD-assisted scenarios even in case of wrong predictions. We believe that ExAID will provide dermatologists an effective screening tool that they both understand and trust. Moreover, it will be the basis for similar applications in other biomedical imaging fields.
    Towards a Rigorous Evaluation of Time-series Anomaly Detection. (arXiv:2109.05257v2 [cs.LG] UPDATED)
    (2 min) In recent years, proposed studies on time-series anomaly detection (TAD) report high F1 scores on benchmark TAD datasets, giving the impression of clear improvements in TAD. However, most studies apply a peculiar evaluation protocol called point adjustment (PA) before scoring. In this paper, we theoretically and experimentally reveal that the PA protocol has a great possibility of overestimating the detection performance; that is, even a random anomaly score can easily turn into a state-of-the-art TAD method. Therefore, the comparison of TAD methods after applying the PA protocol can lead to misguided rankings. Furthermore, we question the potential of existing TAD methods by showing that an untrained model obtains comparable detection performance to the existing methods even when PA is forbidden. Based on our findings, we propose a new baseline and an evaluation protocol. We expect that our study will help a rigorous evaluation of TAD and lead to further improvement in future researches.
    Graph Neural Networks: a bibliometrics overview. (arXiv:2201.01188v1 [cs.LG])
    (2 min) Recently, graph neural networks have become a hot topic in machine learning community. This paper presents a Scopus based bibliometric overview of the GNNs research since 2004, when GNN papers were first published. The study aims to evaluate GNN research trend, both quantitatively and qualitatively. We provide the trend of research, distribution of subjects, active and influential authors and institutions, sources of publications, most cited documents, and hot topics. Our investigations reveal that the most frequent subject categories in this field are computer science, engineering, telecommunications, linguistics, operations research and management science, information science and library science, business and economics, automation and control systems, robotics, and social sciences. In addition, the most active source of GNN publications is Lecture Notes in Computer Science. The most prolific or impactful institutions are found in the United States, China, and Canada. We also provide must read papers and future directions. Finally, the application of graph convolutional networks and attention mechanism are now among hot topics of GNN research.
    The cluster structure function. (arXiv:2201.01222v1 [cs.LG])
    (2 min) For each partition of a data set into a given number of parts there is a partition such that every part is as much as possible a good model (an "algorithmic sufficient statistic") for the data in that part. Since this can be done for every number between one and the number of data, the result is a function, the cluster structure function. It maps the number of parts of a partition to values related to the deficiencies of being good models by the parts. Such a function starts with a value at least zero for no partition of the data set and descents to zero for the partition of the data set into singleton parts. The optimal clustering is the one chosen to minimize the cluster structure function. The theory behind the method is expressed in algorithmic information theory (Kolmogorov complexity). In practice the Kolmogorov complexities involved are approximated by a concrete compressor. We give examples using real data sets: the MNIST handwritten digits and the segmentation of real cells as used in stem cell research.
    AutoBalance: Optimized Loss Functions for Imbalanced Data. (arXiv:2201.01212v1 [cs.LG])
    (2 min) Imbalanced datasets are commonplace in modern machine learning problems. The presence of under-represented classes or groups with sensitive attributes results in concerns about generalization and fairness. Such concerns are further exacerbated by the fact that large capacity deep nets can perfectly fit the training data and appear to achieve perfect accuracy and fairness during training, but perform poorly during test. To address these challenges, we propose AutoBalance, a bi-level optimization framework that automatically designs a training loss function to optimize a blend of accuracy and fairness-seeking objectives. Specifically, a lower-level problem trains the model weights, and an upper-level problem tunes the loss function by monitoring and optimizing the desired objective over the validation data. Our loss design enables personalized treatment for classes/groups by employing a parametric cross-entropy loss and individualized data augmentation schemes. We evaluate the benefits and performance of our approach for the application scenarios of imbalanced and group-sensitive classification. Extensive empirical evaluations demonstrate the benefits of AutoBalance over state-of-the-art approaches. Our experimental findings are complemented with theoretical insights on loss function design and the benefits of train-validation split. All code is available open-source.
    Sequoia: A Software Framework to Unify Continual Learning Research. (arXiv:2108.01005v3 [cs.LG] UPDATED)
    (2 min) The field of Continual Learning (CL) seeks to develop algorithms that accumulate knowledge and skills over time through interaction with non-stationary environments. In practice, a plethora of evaluation procedures (settings) and algorithmic solutions (methods) exist, each with their own potentially disjoint set of assumptions. This variety makes measuring progress in CL difficult. We propose a taxonomy of settings, where each setting is described as a set of assumptions. A tree-shaped hierarchy emerges from this view, where more general settings become the parents of those with more restrictive assumptions. This makes it possible to use inheritance to share and reuse research, as developing a method for a given setting also makes it directly applicable onto any of its children. We instantiate this idea as a publicly available software framework called Sequoia, which features a wide variety of settings from both the Continual Supervised Learning (CSL) and Continual Reinforcement Learning (CRL) domains. Sequoia also includes a growing suite of methods which are easy to extend and customize, in addition to more specialized methods from external libraries. We hope that this new paradigm and its first implementation can help unify and accelerate research in CL. You can help us grow the tree by visiting github.com/lebrice/Sequoia.
    Neural Actor: Neural Free-view Synthesis of Human Actors with Pose Control. (arXiv:2106.02019v2 [cs.CV] UPDATED)
    (0 min) We propose Neural Actor (NA), a new method for high-quality synthesis of humans from arbitrary viewpoints and under arbitrary controllable poses. Our method is built upon recent neural scene representation and rendering works which learn representations of geometry and appearance from only 2D images. While existing works demonstrated compelling rendering of static scenes and playback of dynamic scenes, photo-realistic reconstruction and rendering of humans with neural implicit methods, in particular under user-controlled novel poses, is still difficult. To address this problem, we utilize a coarse body model as the proxy to unwarp the surrounding 3D space into a canonical pose. A neural radiance field learns pose-dependent geometric deformations and pose- and view-dependent appearance effects in the canonical space from multi-view video input. To synthesize novel views of high fidelity dynamic geometry and appearance, we leverage 2D texture maps defined on the body model as latent variables for predicting residual deformations and the dynamic appearance. Experiments demonstrate that our method achieves better quality than the state-of-the-arts on playback as well as novel pose synthesis, and can even generalize well to new poses that starkly differ from the training poses. Furthermore, our method also supports body shape control of the synthesized results.
    Monitoring and Anomaly Detection Actor-Critic Based Controlled Sensing. (arXiv:2201.00879v1 [cs.LG])
    (2 min) We address the problem of monitoring a set of binary stochastic processes and generating an alert when the number of anomalies among them exceeds a threshold. For this, the decision-maker selects and probes a subset of the processes to obtain noisy estimates of their states (normal or anomalous). Based on the received observations, the decisionmaker first determines whether to declare that the number of anomalies has exceeded the threshold or to continue taking observations. When the decision is to continue, it then decides whether to collect observations at the next time instant or defer it to a later time. If it chooses to collect observations, it further determines the subset of processes to be probed. To devise this three-step sequential decision-making process, we use a Bayesian formulation wherein we learn the posterior probability on the states of the processes. Using the posterior probability, we construct a Markov decision process and solve it using deep actor-critic reinforcement learning. Via numerical experiments, we demonstrate the superior performance of our algorithm compared to the traditional model-based algorithms.
    Variance-Aware Off-Policy Evaluation with Linear Function Approximation. (arXiv:2106.11960v2 [cs.LG] UPDATED)
    (2 min) We study the off-policy evaluation (OPE) problem in reinforcement learning with linear function approximation, which aims to estimate the value function of a target policy based on the offline data collected by a behavior policy. We propose to incorporate the variance information of the value function to improve the sample efficiency of OPE. More specifically, for time-inhomogeneous episodic linear Markov decision processes (MDPs), we propose an algorithm, VA-OPE, which uses the estimated variance of the value function to reweight the Bellman residual in Fitted Q-Iteration. We show that our algorithm achieves a tighter error bound than the best-known result. We also provide a fine-grained characterization of the distribution shift between the behavior policy and the target policy. Extensive numerical experiments corroborate our theory.
    Neural Piecewise-Constant Delay Differential Equations. (arXiv:2201.00960v1 [cs.LG])
    (2 min) Continuous-depth neural networks, such as the Neural Ordinary Differential Equations (ODEs), have aroused a great deal of interest from the communities of machine learning and data science in recent years, which bridge the connection between deep neural networks and dynamical systems. In this article, we introduce a new sort of continuous-depth neural network, called the Neural Piecewise-Constant Delay Differential Equations (PCDDEs). Here, unlike the recently proposed framework of the Neural Delay Differential Equations (DDEs), we transform the single delay into the piecewise-constant delay(s). The Neural PCDDEs with such a transformation, on one hand, inherit the strength of universal approximating capability in Neural DDEs. On the other hand, the Neural PCDDEs, leveraging the contributions of the information from the multiple previous time steps, further promote the modeling capability without augmenting the network dimension. With such a promotion, we show that the Neural PCDDEs do outperform the several existing continuous-depth neural frameworks on the one-dimensional piecewise-constant delay population dynamics and real-world datasets, including MNIST, CIFAR10, and SVHN.
    Evolutionary Multitasking AUC Optimization. (arXiv:2201.01145v1 [cs.LG])
    (2 min) Learning to optimize the area under the receiver operating characteristics curve (AUC) performance for imbalanced data has attracted much attention in recent years. Although there have been several methods of AUC optimization, scaling up AUC optimization is still an open issue due to its pairwise learning style. Maximizing AUC in the large-scale dataset can be considered as a non-convex and expensive problem. Inspired by the characteristic of pairwise learning, the cheap AUC optimization task with a small-scale dataset sampled from the large-scale dataset is constructed to promote the AUC accuracy of the original, large-scale, and expensive AUC optimization task. This paper develops an evolutionary multitasking framework (termed EMTAUC) to make full use of information among the constructed cheap and expensive tasks to obtain higher performance. In EMTAUC, one mission is to optimize AUC from the sampled dataset, and the other is to maximize AUC from the original dataset. Moreover, due to the cheap task containing limited knowledge, a strategy for dynamically adjusting the data structure of inexpensive tasks is proposed to introduce more knowledge into the multitasking AUC optimization environment. The performance of the proposed method is evaluated on a series of binary classification datasets. The experimental results demonstrate that EMTAUC is highly competitive to single task methods and online methods. Supplementary materials and source code implementation of EMTAUC can be accessed at https://github.com/xiaofangxd/EMTAUC.
    Supervised Homogeneity Fusion: a Combinatorial Approach. (arXiv:2201.01036v1 [stat.ML])
    (2 min) Fusing regression coefficients into homogenous groups can unveil those coefficients that share a common value within each group. Such groupwise homogeneity reduces the intrinsic dimension of the parameter space and unleashes sharper statistical accuracy. We propose and investigate a new combinatorial grouping approach called $L_0$-Fusion that is amenable to mixed integer optimization (MIO). On the statistical aspect, we identify a fundamental quantity called grouping sensitivity that underpins the difficulty of recovering the true groups. We show that $L_0$-Fusion achieves grouping consistency under the weakest possible requirement of the grouping sensitivity: if this requirement is violated, then the minimax risk of group misspecification will fail to converge to zero. Moreover, we show that in the high-dimensional regime, one can apply $L_0$-Fusion coupled with a sure screening set of features without any essential loss of statistical efficiency, while reducing the computational cost substantially. On the algorithmic aspect, we provide a MIO formulation for $L_0$-Fusion along with a warm start strategy. Simulation and real data analysis demonstrate that $L_0$-Fusion exhibits superiority over its competitors in terms of grouping accuracy.
    Neural Myerson Auction for Truthful and Energy-Efficient Autonomous Aerial Data Delivery. (arXiv:2201.01170v1 [cs.GT])
    (2 min) A successful deployment of drones provides an ideal solution for surveillance systems. Using drones for surveillance can provide access to areas that may be difficult or impossible to reach by humans or in-land vehicles gathering images or video recordings of a specific target in their coverage. Therefore, we introduces a data delivery drone to transfer collected surveillance data in harsh communication conditions. This paper proposes a Myerson auction-based asynchronous data delivery in an aerial distributed data platform in surveillance systems taking battery limitation and long flight constraints into account. In this paper, multiple delivery drones compete to offer data transfer to a single fixed-location surveillance drone. Our proposed Myerson auction-based algorithm, which uses the truthful second-price auction (SPA) as a baseline, is to maximize the seller's revenue while meeting several desirable properties, i.e., individual rationality and incentive compatibility while pursuing truthful operations. On top of these SPA-based operations, a deep learning-based framework is additionally designed for delivery performance improvements.
    An unfeasability view of neural network learning. (arXiv:2201.00945v1 [cs.LG])
    (2 min) We define the notion of a continuously differentiable perfect learning algorithm for multilayer neural network architectures and show that such algorithms don't exist provided that the length of the data set exceeds the number of involved parameters and the activation functions are logistic, tanh or sin.
    Uniform Convergence of Interpolators: Gaussian Width, Norm Bounds, and Benign Overfitting. (arXiv:2106.09276v2 [stat.ML] UPDATED)
    (2 min) We consider interpolation learning in high-dimensional linear regression with Gaussian data, and prove a generic uniform convergence guarantee on the generalization error of interpolators in an arbitrary hypothesis class in terms of the class's Gaussian width. Applying the generic bound to Euclidean norm balls recovers the consistency result of Bartlett et al. (2020) for minimum-norm interpolators, and confirms a prediction of Zhou et al. (2020) for near-minimal-norm interpolators in the special case of Gaussian data. We demonstrate the generality of the bound by applying it to the simplex, obtaining a novel consistency result for minimum l1-norm interpolators (basis pursuit). Our results show how norm-based generalization bounds can explain and be used to analyze benign overfitting, at least in some settings.
    Deep neural networks for smooth approximation of physics with higher order and continuity B-spline base functions. (arXiv:2201.00904v1 [math.NA])
    (2 min) This paper deals with the following important research question. Traditionally, the neural network employs non-linear activation functions concatenated with linear operators to approximate a given physical phenomenon. They "fill the space" with the concatenations of the activation functions and linear operators and adjust their coefficients to approximate the physical phenomena. We claim that it is better to "fill the space" with linear combinations of smooth higher-order B-splines base functions as employed by isogeometric analysis and utilize the neural networks to adjust the coefficients of linear combinations. In other words, the possibilities of using neural networks for approximating the B-spline base functions' coefficients and by approximating the solution directly are evaluated. Solving differential equations with neural networks has been proposed by Maziar Raissi et al. in 2017 by introducing Physics-informed Neural Networks (PINN), which naturally encode underlying physical laws as prior information. Approximation of coefficients using a function as an input leverages the well-known capability of neural networks being universal function approximators. In essence, in the PINN approach the network approximates the value of the given field at a given point. We present an alternative approach, where the physcial quantity is approximated as a linear combination of smooth B-spline basis functions, and the neural network approximates the coefficients of B-splines. This research compares results from the DNN approximating the coefficients of the linear combination of B-spline basis functions, with the DNN approximating the solution directly. We show that our approach is cheaper and more accurate when approximating smooth physical fields.
    Convolutional Normalization: Improving Deep Convolutional Network Robustness and Training. (arXiv:2103.00673v2 [cs.CV] UPDATED)
    (2 min) Normalization techniques have become a basic component in modern convolutional neural networks (ConvNets). In particular, many recent works demonstrate that promoting the orthogonality of the weights helps train deep models and improve robustness. For ConvNets, most existing methods are based on penalizing or normalizing weight matrices derived from concatenating or flattening the convolutional kernels. These methods often destroy or ignore the benign convolutional structure of the kernels; therefore, they are often expensive or impractical for deep ConvNets. In contrast, we introduce a simple and efficient "Convolutional Normalization" (ConvNorm) method that can fully exploit the convolutional structure in the Fourier domain and serve as a simple plug-and-play module to be conveniently incorporated into any ConvNets. Our method is inspired by recent work on preconditioning methods for convolutional sparse coding and can effectively promote each layer's channel-wise isometry. Furthermore, we show that our ConvNorm can reduce the layerwise spectral norm of the weight matrices and hence improve the Lipschitzness of the network, leading to easier training and improved robustness for deep ConvNets. Applied to classification under noise corruptions and generative adversarial network (GAN), we show that the ConvNorm improves the robustness of common ConvNets such as ResNet and the performance of GAN. We verify our findings via numerical experiments on CIFAR and ImageNet.
    APReL: A Library for Active Preference-based Reward Learning Algorithms. (arXiv:2108.07259v2 [cs.LG] UPDATED)
    (2 min) Reward learning is a fundamental problem in human-robot interaction to have robots that operate in alignment with what their human user wants. Many preference-based learning algorithms and active querying techniques have been proposed as a solution to this problem. In this paper, we present APReL, a library for active preference-based reward learning algorithms, which enable researchers and practitioners to experiment with the existing techniques and easily develop their own algorithms for various modules of the problem. APReL is available at https://github.com/Stanford-ILIAD/APReL.
    Deriving discriminative classifiers from generative models. (arXiv:2201.00844v1 [stat.ML])
    (2 min) We deal with Bayesian generative and discriminative classifiers. Given a model distribution $p(x, y)$, with the observation $y$ and the target $x$, one computes generative classifiers by firstly considering $p(x, y)$ and then using the Bayes rule to calculate $p(x | y)$. A discriminative model is directly given by $p(x | y)$, which is used to compute discriminative classifiers. However, recent works showed that the Bayesian Maximum Posterior classifier defined from the Naive Bayes (NB) or Hidden Markov Chain (HMC), both generative models, can also match the discriminative classifier definition. Thus, there are situations in which dividing classifiers into "generative" and "discriminative" is somewhat misleading. Indeed, such a distinction is rather related to the way of computing classifiers, not to the classifiers themselves. We present a general theoretical result specifying how a generative classifier induced from a generative model can also be computed in a discriminative way from the same model. Examples of NB and HMC are found again as particular cases, and we apply the general result to two original extensions of NB, and two extensions of HMC, one of which being original. Finally, we shortly illustrate the interest of the new discriminative way of computing classifiers in the Natural Language Processing (NLP) framework.
    Low dosage 3D volume fluorescence microscopy imaging using compressive sensing. (arXiv:2201.00820v1 [eess.IV])
    (2 min) Fluorescence microscopy has been a significant tool to observe long-term imaging of embryos (in vivo) growth over time. However, cumulative exposure is phototoxic to such sensitive live samples. While techniques like light-sheet fluorescence microscopy (LSFM) allow for reduced exposure, it is not well suited for deep imaging models. Other computational techniques are computationally expensive and often lack restoration quality. To address this challenge, one can use various low-dosage imaging techniques that are developed to achieve the 3D volume reconstruction using a few slices in the axial direction (z-axis); however, they often lack restoration quality. Also, acquiring dense images (with small steps) in the axial direction is computationally expensive. To address this challenge, we present a compressive sensing (CS) based approach to fully reconstruct 3D volumes with the same signal-to-noise ratio (SNR) with less than half of the excitation dosage. We present the theory and experimentally validate the approach. To demonstrate our technique, we capture a 3D volume of the RFP labeled neurons in the zebrafish embryo spinal cord (30um thickness) with the axial sampling of 0.1um using a confocal microscope. From the results, we observe the CS-based approach achieves accurate 3D volume reconstruction from less than 20% of the entire stack optical sections. The developed CS-based methodology in this work can be easily applied to other deep imaging modalities such as two-photon and light-sheet microscopy, where reducing sample photo-toxicity is a critical challenge.
    Generating synthetic mobility data for a realistic population with RNNs to improve utility and privacy. (arXiv:2201.01139v1 [cs.LG])
    (2 min) Location data collected from mobile devices represent mobility behaviors at individual and societal levels. These data have important applications ranging from transportation planning to epidemic modeling. However, issues must be overcome to best serve these use cases: The data often represent a limited sample of the population and use of the data jeopardizes privacy. To address these issues, we present and evaluate a system for generating synthetic mobility data using a deep recurrent neural network (RNN) which is trained on real location data. The system takes a population distribution as input and generates mobility traces for a corresponding synthetic population. Related generative approaches have not solved the challenges of capturing both the patterns and variability in individuals' mobility behaviors over longer time periods, while also balancing the generation of realistic data with privacy. Our system leverages RNNs' ability to generate complex and novel sequences while retaining patterns from training data. Also, the model introduces randomness used to calibrate the variation between the synthetic and real data at the individual level. This is to both capture variability in human mobility, and protect user privacy. Location based services (LBS) data from more than 22,700 mobile devices were used in an experimental evaluation across utility and privacy metrics. We show the generated mobility data retain the characteristics of the real data, while varying from the real data at the individual level, and where this amount of variation matches the variation within the real data.
  • stat.ML updates on arXiv.org

    Exploring Architectural Ingredients of Adversarially Robust Deep Neural Networks. (arXiv:2110.03825v4 [cs.LG] UPDATED)
    (2 min) Deep neural networks (DNNs) are known to be vulnerable to adversarial attacks. A range of defense methods have been proposed to train adversarially robust DNNs, among which adversarial training has demonstrated promising results. However, despite preliminary understandings developed for adversarial training, it is still not clear, from the architectural perspective, what configurations can lead to more robust DNNs. In this paper, we address this gap via a comprehensive investigation on the impact of network width and depth on the robustness of adversarially trained DNNs. Specifically, we make the following key observations: 1) more parameters (higher model capacity) does not necessarily help adversarial robustness; 2) reducing capacity at the last stage (the last group of blocks) of the network can actually improve adversarial robustness; and 3) under the same parameter budget, there exists an optimal architectural configuration for adversarial robustness. We also provide a theoretical analysis explaning why such network configuration can help robustness. These architectural insights can help design adversarially robust DNNs. Code is available at \url{https://github.com/HanxunH/RobustWRN}.
    Uniform Convergence of Interpolators: Gaussian Width, Norm Bounds, and Benign Overfitting. (arXiv:2106.09276v2 [stat.ML] UPDATED)
    (2 min) We consider interpolation learning in high-dimensional linear regression with Gaussian data, and prove a generic uniform convergence guarantee on the generalization error of interpolators in an arbitrary hypothesis class in terms of the class's Gaussian width. Applying the generic bound to Euclidean norm balls recovers the consistency result of Bartlett et al. (2020) for minimum-norm interpolators, and confirms a prediction of Zhou et al. (2020) for near-minimal-norm interpolators in the special case of Gaussian data. We demonstrate the generality of the bound by applying it to the simplex, obtaining a novel consistency result for minimum l1-norm interpolators (basis pursuit). Our results show how norm-based generalization bounds can explain and be used to analyze benign overfitting, at least in some settings.
    Fast Approximation of the Sliced-Wasserstein Distance Using Concentration of Random Projections. (arXiv:2106.15427v2 [stat.ML] UPDATED)
    (2 min) The Sliced-Wasserstein distance (SW) is being increasingly used in machine learning applications as an alternative to the Wasserstein distance and offers significant computational and statistical benefits. Since it is defined as an expectation over random projections, SW is commonly approximated by Monte Carlo. We adopt a new perspective to approximate SW by making use of the concentration of measure phenomenon: under mild assumptions, one-dimensional projections of a high-dimensional random vector are approximately Gaussian. Based on this observation, we develop a simple deterministic approximation for SW. Our method does not require sampling a number of random projections, and is therefore both accurate and easy to use compared to the usual Monte Carlo approximation. We derive nonasymptotical guarantees for our approach, and show that the approximation error goes to zero as the dimension increases, under a weak dependence condition on the data distribution. We validate our theoretical findings on synthetic datasets, and illustrate the proposed approximation on a generative modeling problem.
    Discovering Diverse Nearly Optimal Policies with Successor Features. (arXiv:2106.00669v2 [cs.AI] UPDATED)
    (2 min) Finding different solutions to the same problem is a key aspect of intelligence associated with creativity and adaptation to novel situations. In reinforcement learning, a set of diverse policies can be useful for exploration, transfer, hierarchy, and robustness. We propose Diverse Successive Policies, a method for discovering policies that are diverse in the space of Successor Features, while assuring that they are near optimal. We formalize the problem as a Constrained Markov Decision Process (CMDP) where the goal is to find policies that maximize diversity, characterized by an intrinsic diversity reward, while remaining near-optimal with respect to the extrinsic reward of the MDP. We also analyze how recently proposed robustness and discrimination rewards perform and find that they are sensitive to the initialization of the procedure and may converge to sub-optimal solutions. To alleviate this, we propose new explicit diversity rewards that aim to minimize the correlation between the Successor Features of the policies in the set. We compare the different diversity mechanisms in the DeepMind Control Suite and find that the type of explicit diversity we are proposing is important to discover distinct behavior, like for example different locomotion patterns.
    Variance-Aware Off-Policy Evaluation with Linear Function Approximation. (arXiv:2106.11960v2 [cs.LG] UPDATED)
    (2 min) We study the off-policy evaluation (OPE) problem in reinforcement learning with linear function approximation, which aims to estimate the value function of a target policy based on the offline data collected by a behavior policy. We propose to incorporate the variance information of the value function to improve the sample efficiency of OPE. More specifically, for time-inhomogeneous episodic linear Markov decision processes (MDPs), we propose an algorithm, VA-OPE, which uses the estimated variance of the value function to reweight the Bellman residual in Fitted Q-Iteration. We show that our algorithm achieves a tighter error bound than the best-known result. We also provide a fine-grained characterization of the distribution shift between the behavior policy and the target policy. Extensive numerical experiments corroborate our theory.
    Supervised Homogeneity Fusion: a Combinatorial Approach. (arXiv:2201.01036v1 [stat.ML])
    (2 min) Fusing regression coefficients into homogenous groups can unveil those coefficients that share a common value within each group. Such groupwise homogeneity reduces the intrinsic dimension of the parameter space and unleashes sharper statistical accuracy. We propose and investigate a new combinatorial grouping approach called $L_0$-Fusion that is amenable to mixed integer optimization (MIO). On the statistical aspect, we identify a fundamental quantity called grouping sensitivity that underpins the difficulty of recovering the true groups. We show that $L_0$-Fusion achieves grouping consistency under the weakest possible requirement of the grouping sensitivity: if this requirement is violated, then the minimax risk of group misspecification will fail to converge to zero. Moreover, we show that in the high-dimensional regime, one can apply $L_0$-Fusion coupled with a sure screening set of features without any essential loss of statistical efficiency, while reducing the computational cost substantially. On the algorithmic aspect, we provide a MIO formulation for $L_0$-Fusion along with a warm start strategy. Simulation and real data analysis demonstrate that $L_0$-Fusion exhibits superiority over its competitors in terms of grouping accuracy.
    Self-supervised representation learning from 12-lead ECG data. (arXiv:2103.12676v2 [eess.SP] UPDATED)
    (2 min) Clinical 12-lead electrocardiography (ECG) is one of the most widely encountered kinds of biosignals. Despite the increased availability of public ECG datasets, label scarcity remains a central challenge in the field. Self-supervised learning represents a promising way to alleviate this issue. In this work, we put forward the first comprehensive assessment of self-supervised representation learning from clinical 12-lead ECG data. To this end, we adapt state-of-the-art self-supervised methods based on instance discrimination and latent forecasting to the ECG domain. In a first step, we learn contrastive representations and evaluate their quality based on linear evaluation performance on a recently established, comprehensive, clinical ECG classification task. In a second step, we analyze the impact of self-supervised pretraining on finetuned ECG classifiers as compared to purely supervised performance. For the best-performing method, an adaptation of contrastive predictive coding, we find a linear evaluation performance only 0.5% below supervised performance. For the finetuned models, we find improvements in downstream performance of roughly 1% compared to supervised performance, label efficiency, as well as robustness against physiological noise. This work clearly establishes the feasibility of extracting discriminative representations from ECG data via self-supervised learning and the numerous advantages when finetuning such representations on downstream tasks as compared to purely supervised training. As first comprehensive assessment of its kind in the ECG domain carried out exclusively on publicly available datasets, we hope to establish a first step towards reproducible progress in the rapidly evolving field of representation learning for biosignals.
    Kendall transformation: a robust representation of continuous data for information theory. (arXiv:2006.15991v2 [cs.LG] UPDATED)
    (2 min) Kendall transformation is a conversion of an ordered feature into a vector of pairwise order relations between individual values. This way, it preserves ranking of observations and represents it in a categorical form. Such transformation allows for generalisation of methods requiring strictly categorical input, especially in the limit of small number of observations, when discretisation becomes problematic. In particular, many approaches of information theory can be directly applied to Kendall-transformed continuous data without relying on differential entropy or any additional parameters. Moreover, by filtering information to this contained in ranking, Kendall transformation leads to a better robustness at a reasonable cost of dropping sophisticated interactions which are anyhow unlikely to be correctly estimated. In bivariate analysis, Kendall transformation can be related to popular non-parametric methods, showing the soundness of the approach. The paper also demonstrates its efficiency in multivariate problems, as well as provides an example analysis of a real-world data.
    Deriving discriminative classifiers from generative models. (arXiv:2201.00844v1 [stat.ML])
    (2 min) We deal with Bayesian generative and discriminative classifiers. Given a model distribution $p(x, y)$, with the observation $y$ and the target $x$, one computes generative classifiers by firstly considering $p(x, y)$ and then using the Bayes rule to calculate $p(x | y)$. A discriminative model is directly given by $p(x | y)$, which is used to compute discriminative classifiers. However, recent works showed that the Bayesian Maximum Posterior classifier defined from the Naive Bayes (NB) or Hidden Markov Chain (HMC), both generative models, can also match the discriminative classifier definition. Thus, there are situations in which dividing classifiers into "generative" and "discriminative" is somewhat misleading. Indeed, such a distinction is rather related to the way of computing classifiers, not to the classifiers themselves. We present a general theoretical result specifying how a generative classifier induced from a generative model can also be computed in a discriminative way from the same model. Examples of NB and HMC are found again as particular cases, and we apply the general result to two original extensions of NB, and two extensions of HMC, one of which being original. Finally, we shortly illustrate the interest of the new discriminative way of computing classifiers in the Natural Language Processing (NLP) framework.
    Open Access Dataset for Electromyography based Multi-code Biometric Authentication. (arXiv:2201.01051v1 [cs.CR])
    (2 min) Recently, surface electromyogram (EMG) has been proposed as a novel biometric trait for addressing some key limitations of current biometrics, such as spoofing and liveness. The EMG signals possess a unique characteristic: they are inherently different for individuals (biometrics), and they can be customized to realize multi-length codes or passwords (for example, by performing different gestures). However, current EMG-based biometric research has two critical limitations: 1) a small subject pool, compared to other more established biometric traits, and 2) limited to single-session or single-day data sets. In this study, forearm and wrist EMG data were collected from 43 participants over three different days with long separation while they performed static hand and wrist gestures. The multi-day biometric authentication resulted in a median EER of 0.017 for the forearm setup and 0.025 for the wrist setup, comparable to well-established biometric traits suggesting consistent performance over multiple days. The presented large-sample multi-day data set and findings could facilitate further research on EMG-based biometrics and other gesture recognition-based applications.
    Neural Estimation of Statistical Divergences. (arXiv:2110.03652v2 [math.ST] UPDATED)
    (2 min) Statistical divergences (SDs), which quantify the dissimilarity between probability distributions, are a basic constituent of statistical inference and machine learning. A modern method for estimating those divergences relies on parametrizing an empirical variational form by a neural network (NN) and optimizing over parameter space. Such neural estimators are abundantly used in practice, but corresponding performance guarantees are partial and call for further exploration. In particular, there is a fundamental tradeoff between the two sources of error involved: approximation and empirical estimation. While the former needs the NN class to be rich and expressive, the latter relies on controlling complexity. We explore this tradeoff for an estimator based on a shallow NN by means of non-asymptotic error bounds, focusing on four popular $\mathsf{f}$-divergences -- Kullback-Leibler, chi-squared, squared Hellinger, and total variation. Our analysis relies on non-asymptotic function approximation theorems and tools from empirical process theory. The bounds reveal the tension between the NN size and the number of samples, and enable to characterize scaling rates thereof that ensure consistency. For compactly supported distributions, we further show that neural estimators of the first three divergences above with appropriate NN growth-rate are near minimax rate-optimal, achieving the parametric rate up to logarithmic factors.
    Optimal design of the Barker proposal and other locally-balanced Metropolis-Hastings algorithms. (arXiv:2201.01123v1 [stat.CO])
    (2 min) We study the class of first-order locally-balanced Metropolis--Hastings algorithms introduced in Livingstone & Zanella (2021). To choose a specific algorithm within the class the user must select a balancing function $g:\mathbb{R} \to \mathbb{R}$ satisfying $g(t) = tg(1/t)$, and a noise distribution for the proposal increment. Popular choices within the class are the Metropolis-adjusted Langevin algorithm and the recently introduced Barker proposal. We first establish a universal limiting optimal acceptance rate of 57% and scaling of $n^{-1/3}$ as the dimension $n$ tends to infinity among all members of the class under mild smoothness assumptions on $g$ and when the target distribution for the algorithm is of the product form. In particular we obtain an explicit expression for the asymptotic efficiency of an arbitrary algorithm in the class, as measured by expected squared jumping distance. We then consider how to optimise this expression under various constraints. We derive an optimal choice of noise distribution for the Barker proposal, optimal choice of balancing function under a Gaussian noise distribution, and optimal choice of first-order locally-balanced algorithm among the entire class, which turns out to depend on the specific target distribution. Numerical simulations confirm our theoretical findings and in particular show that a bi-modal choice of noise distribution in the Barker proposal gives rise to a practical algorithm that is consistently more efficient than the original Gaussian version.
    Learning to Play No-Press Diplomacy with Best Response Policy Iteration. (arXiv:2006.04635v4 [cs.LG] UPDATED)
    (2 min) Recent advances in deep reinforcement learning (RL) have led to considerable progress in many 2-player zero-sum games, such as Go, Poker and Starcraft. The purely adversarial nature of such games allows for conceptually simple and principled application of RL methods. However real-world settings are many-agent, and agent interactions are complex mixtures of common-interest and competitive aspects. We consider Diplomacy, a 7-player board game designed to accentuate dilemmas resulting from many-agent interactions. It also features a large combinatorial action space and simultaneous moves, which are challenging for RL algorithms. We propose a simple yet effective approximate best response operator, designed to handle large combinatorial action spaces and simultaneous moves. We also introduce a family of policy iteration methods that approximate fictitious play. With these methods, we successfully apply RL to Diplomacy: we show that our agents convincingly outperform the previous state-of-the-art, and game theoretic equilibrium analysis shows that the new process yields consistent improvements.
    Descriptive vs. inferential community detection: pitfalls, myths and half-truths. (arXiv:2112.00183v3 [physics.soc-ph] UPDATED)
    (2 min) Community detection is one of the most important methodological fields of network science, and one which has attracted a significant amount of attention over the past decades. This area deals with the automated division of a network into fundamental building blocks, with the objective of providing a summary of its large-scale structure. Despite its importance and widespread adoption, there is a noticeable gap between what is considered the state-of-the-art and the methods that are actually used in practice in a variety of fields. Here we attempt to address this discrepancy by dividing existing methods according to whether they have a "descriptive" or an "inferential" goal. While descriptive methods find patterns in networks based on intuitive notions of community structure, inferential methods articulate a precise generative model, and attempt to fit it to data. In this way, they are able to provide insights into the mechanisms of network formation, and separate structure from randomness in a manner supported by statistical evidence. We review how employing descriptive methods with inferential aims is riddled with pitfalls and misleading answers, and thus should be in general avoided. We argue that inferential methods are more typically aligned with clearer scientific questions, yield more robust results, and should be in many cases preferred. We attempt to dispel some myths and half-truths often believed when community detection is employed in practice, in an effort to improve both the use of such methods as well as the interpretation of their results.
    On the effectiveness of adversarial training against common corruptions. (arXiv:2103.02325v2 [cs.LG] UPDATED)
    (2 min) The literature on robustness towards common corruptions shows no consensus on whether adversarial training can improve the performance in this setting. First, we show that, when used with an appropriately selected perturbation radius, $\ell_p$ adversarial training can serve as a strong baseline against common corruptions improving both accuracy and calibration. Then we explain why adversarial training performs better than data augmentation with simple Gaussian noise which has been observed to be a meaningful baseline on common corruptions. Related to this, we identify the $\sigma$-overfitting phenomenon when Gaussian augmentation overfits to a particular standard deviation used for training which has a significant detrimental effect on common corruption accuracy. We discuss how to alleviate this problem and then how to further enhance $\ell_p$ adversarial training by introducing an efficient relaxation of adversarial training with learned perceptual image patch similarity as the distance metric. Through experiments on CIFAR-10 and ImageNet-100, we show that our approach does not only improve the $\ell_p$ adversarial training baseline but also has cumulative gains with data augmentation methods such as AugMix, DeepAugment, ANT, and SIN, leading to state-of-the-art performance on common corruptions. The code of our experiments is publicly available at https://github.com/tml-epfl/adv-training-corruptions.
    Biased Hypothesis Formation From Projection Pursuit. (arXiv:2201.00889v1 [cs.LG])
    (2 min) The effect of bias on hypothesis formation is characterized for an automated data-driven projection pursuit neural network to extract and select features for binary classification of data streams. This intelligent exploratory process partitions a complete vector state space into disjoint subspaces to create working hypotheses quantified by similarities and differences observed between two groups of labeled data streams. Data streams are typically time sequenced, and may exhibit complex spatio-temporal patterns. For example, given atomic trajectories from molecular dynamics simulation, the machine's task is to quantify dynamical mechanisms that promote function by comparing protein mutants, some known to function while others are nonfunctional. Utilizing synthetic two-dimensional molecules that mimic the dynamics of functional and nonfunctional proteins, biases are identified and controlled in both the machine learning model and selected training data under different contexts. The refinement of a working hypothesis converges to a statistically robust multivariate perception of the data based on a context-dependent perspective. Including diverse perspectives during data exploration enhances interpretability of the multivariate characterization of similarities and differences.
  • The Official NVIDIA Blog

    Prepare for Genshin Impact, Coming to GeForce NOW in Limited Beta
    (4 min) GeForce NOW is charging into the new year at full force. This GFN Thursday comes with the news that Genshin Impact, the popular open-world action role-playing game, will be coming to the cloud this year, arriving in a limited beta. Plus, this year’s CES announcements were packed with news for GeForce NOW. Battlefield 4: Premium Read article > The post Prepare for Genshin Impact, Coming to GeForce NOW in Limited Beta appeared first on The Official NVIDIA Blog.
  • Blog

    Python debugging tools
    (16 min) In all programming exercises, it is difficult to go far and deep without a handy debugger. The built-in debugger, pdb, […] The post Python debugging tools appeared first on Machine Learning Mastery.
  • Artificial Intelligence

    Meta AI and CMU Researchers Present ‘BANMo’: A New Neural Network-Based Method To Build Animatable 3D Models From Videos
    (1 min) Previous work on articulated 3D shape reconstruction has frequently relied on specialized sensors (e.g., synchronized multi-camera systems) or pre-built 3D deformable models (e.g., SMAL or SMPL). Such approaches cannot scale to a wide range of items in the wild. BANMo is a technique that does not need a specialized sensor or a pre-defined template form. In a differentiable rendering framework, BANMo generates high-fidelity, articulated 3D models (including state and animatable skinning weights) from a large number of monocular casual films. While the usage of several films increases coverage of camera perspectives and object articulations, it introduces significant issues in establishing correlation across scenes with diverse backdrops, lighting conditions, etc. Continue Reading Paper: https://arxiv.org/pdf/2112.12761.pdf Project: https://banmo-www.github.io/ ​ https://reddit.com/link/rxlq17/video/7a0w2v9u44a81/player submitted by /u/ai-lover [link] [comments]
    [R] University of Amsterdam & Meta AI Propose a Roadmap Toward Interactive Language Modelling Based on Caregiver-Child Interactions
    (1 min) In the new paper Towards Interactive Language Modeling, a research team from the University of Amsterdam and Meta AI Labs presents a road map detailing the steps to be taken towards interactive language modelling. Here is a quick read: University of Amsterdam & Meta AI Propose a Roadmap Toward Interactive Language Modelling Based on Caregiver-Child Interactions. The paper Towards Interactive Language Modelling is on arXiv. submitted by /u/Yuqing7 [link] [comments]
    What is DeepMind's Gopher?
    (0 min) submitted by /u/Beautiful-Credit-868 [link] [comments]
    Reinforcement learning for the real world
    (0 min) submitted by /u/bendee983 [link] [comments]
    SaneBox | Clean up your inbox in minutes & keep it that way forever
    (0 min) submitted by /u/belqassim [link] [comments]
    Last Week in AI - AI enables brain interface for robot control, Deep Learning suffers from overinterpretation, and more!
    (0 min) submitted by /u/regalalgorithm [link] [comments]
    Where can I find AIs which are able to edit images automatically?
    (1 min) Like filters or even better. submitted by /u/xXLisa28Xx [link] [comments]
    We Are Miracles Made of the Cosmic Sea of Miracles
    (1 min) submitted by /u/sanguineon [link] [comments]
    State of art in Knowledge representation of image and text pair in AI model
    (1 min) If there is a training data which consists of pair of image and corresponding textual explanation of that image. How to represent this Semantic knowledge between image and corresponding textual meaning in AI model. So that we don't need much of training data, if I can infuse that knowledge between image and text pair. Idea is to infuse this knowledge and then at inference time, if such image appears then perform action. Ex- If there is a image and respective meaning is 'open the door', then if robot sees that image at inference it will open the door. I am thinking of NLP based model to infuse this knowledge, but not sure how BERT like model can be employed in this situation. Is there existing SOTA which addresses this kind of issues? submitted by /u/projekt_treadstone [link] [comments]
    Researchers From Stanford and NVIDIA Introduce A Tri-Plane-Based 3D GAN Framework To Enable High-Resolution Geometry-Aware Image Synthesis
    (1 min) Generative Adversarial Networks (GANs) have been one of the main hypes of recent years. Based on the famous generator-discriminator mechanism, their very simple functioning has driven the research to continuously improve the former architecture. The peak in image generation has been reached by StyleGANs, which can produce astonishingly realistic and high-quality images, able to fool even humans. While the generation of new samples has achieved excellent results in the 2D domain, 3D GANs are still highly inefficient. If the exact mechanism of 2D GANs is applied in the 3D environment, the computational effort is too high since 3D data is tough to manipulate for current GPUs. For this reason, the research has focused on how to construct geometry-aware GANs that can infer the underline 3D property using solely 2D images. But, in this case, the approximations are usually not 3D consistent. Continue Reading The Paper Summary Paper: https://arxiv.org/pdf/2112.07945.pdf Project: https://matthew-a-chan.github.io/EG3D/ ​ https://reddit.com/link/rx5bpe/video/bdca5nd9uz981/player submitted by /u/ai-lover [link] [comments]
  • Machine Learning

    [D] Turning petabyte-scale video data into great datasets
    (1 min) Imagine 10 cameras, running 24/7 at 30 FPS - that's 27 million frames generated in a single day. Knowing what's in that data or finding the 1% of things that are actually interesting is hard. I genuinely think there's way too little attention put forth to how people should use their raw data effectively even though so many people choose to store petabytes of it, just in case. I wrote a little article about taking raw video and turning that into an actionable computer vision model. Would love to have a technical discussion about this so comment away :) https://towardsdatascience.com/curating-a-dataset-from-raw-images-and-videos-c8b962eca9ba P.S. Yes, I'm also the OP of the post showcasing Sieve from a few days ago! submitted by /u/happybirthday290 [link] [comments]
    [Project] Guidance on Key Information Extraction for financial statements
    (2 min) Hello all, I'm currently working on a project where I'm trying to perform key information extraction (KIE) and classification on rental property operating statements. My plan is to pre-train a model on a similar dataset (e.g. [IBM FinTabNet](https://developer.ibm.com/exchanges/data/all/fintabnet/), etc.) and then fine-tune on a custom dataset I'm creating. My task is to: Identify the key:value pairs associated with each line item for each period in the statement. So for example, for the first line item, we'd have a 3-tuple of ('Rent income', 'Nov 2020', 21,428.03) Identify the key:value pairs associated with the total for each line item in the statement. So for example, for the first line item, we'd have a 3-tuple of ('Rent income', 'Total', 322,872.36) Automatically classify line it…
    [D] Sparsest Cut in practice?
    (1 min) Sparsest cut, and its close variant "balanced cut", have been studied for ages in the computer science theory community, and there are algorithms to compute this which run fast. Sparsest cut is often billed as a "graph clustering" algorithm. Sparsest cut gives rise to an obvious data clustering algorithm: build a graph on your data (like kNN or threshold graph), and run sparsest cut to get a 2-way partition. To do a k-way partition, you could recursively do this. From what I know, sparsest cut intuition is one thing that guided the creation of spectral clustering, a widely used clustering method. HHas anyone tried implementing this in practice? If so, how does it do on data? If not, why not? I have searched extensively on Google for any implementations of this sparsest-cut style clustering, but I haven't found any. Given that sparsest cut is one of the most well-studied problems in computer theory, I was surprised that no implementation of this clustering method exists. submitted by /u/Grand_Distribution83 [link] [comments]
    [D] Any measurement studies about MLaaS monetisation
    (1 min) ML as a Service appears to be the new hype, but I cant quite figure out how such technologies are monetised. GPT-3 appears to have attracted quite a lot of clients, but is it profitable at all? What do the incentives look like it this market? Are model extraction attacks a problem here at all? How does one protect their IP? I heard of people watermarking their models, but I am not sure how such techniques in their current form would fare in court. submitted by /u/SuchOccasion457 [link] [comments]
    [D] Yolov5 TTA slower on smaller image
    (1 min) Hi all, I'm using a Yolov5 model to make inferences and noticed that it was faster at inferring 320x320 images (~0.25s/image) compared to 288x288 images (~0.45s/image). When I disabled test time augmentation, the gap closed, and both image sizes are inferred at ~0.084s/image. Any ideas why augmenting a smaller image size is slower than augmenting a faster image size? submitted by /u/queue_learning [link] [comments]
    [N] MDLI Ops – A free conference to help you make sense of the MLOps landscape
    (2 min) Disclaimer: Not my conference, but a good friend's work. He's doing an awesome job community building and I thought this might interest the community here as well (I am not a sponsor). Hey r/ML, some of you might have heard about MDLI – short for Machine & Deep Learning Israel. It's an independent Israeli community (over 25K members) for professionals in data science and ML, and they are having a conference for everyone (in English), in just 14 days. You can register here https://machinelearning.co.il/mdli-ops-2022/ I know that sometimes these free events tend to feel like commercials, but you should really check out the agenda. Some of the talks are going to be by ML teams at companies like AppsFlyer and BigPanda, explaining how they built their internal stacks and ML systems. There's also a super interesting talk by the NVIDIA team, where they will talk about building supercomputers to train ML models on them. This one seems crazy awesome to me. It's also a good opportunity to listen to offerings by some cool ML startups, which might give you a better understanding of how they compare, and what you actually care about when choosing MLOps tools. The organizers will probably be here in the comments if you have any questions, but in my opinion, every event this community organizes is really awesome, and I get to learn a lot, so I really recommend this. submitted by /u/PhYsIcS-GUY227 [link] [comments]
    [P] Machine Learning Engineering Conferences
    (1 min) I was wondering if anyone could recommend and conferences suitable for MLE’s, maybe with a focus on deploying models. I’ve been to traditional data science focused one, such as the Anaconda one. Though I wanted to see if any existed that were more closely connected to deployment. submitted by /u/MenArePigs69 [link] [comments]
    [P] Deepchecks: an open-source tool for high standards validations for ML models and data.
    (2 min) Hey everyone! I wanted to share with you an open-source tool we've been building for a while. Deepchecks is an open-source tool for validating & testing models and data efficiently: https://github.com/deepchecks/deepchecks Deepchecks is a python package, implementing validations and tests needed in order to trust an ML pipeline. It contains many built-in checks, such as verifying the data integrity, inspecting its distributions, validating data splits, evaluating your model, and comparing between different models. In addition, it contains test suites, similar to the test suites in software programs, that can accompany you through all building blocks of the ML pipeline development. Each test suite contains checks necessary for the specific part in the pipeline. The suite result looks something like this: ​ Suite result The suites and checks have a simple syntax and are highly customizable. If you want to jump right in, you can try it out in the quick start notebook: https://docs.deepchecks.com/en/stable/examples/guides/quickstart_in_5_minutes.html What do you think? I’ll be happy to hear your thoughts and feedback. submitted by /u/EuphoricMeal8344 [link] [comments]
    Relationship extraction for knowledge graph creation from biomedical literature
    (0 min) submitted by /u/dreadknight011 [link] [comments]
    [D] Retrieval transformers for other domains?
    (1 min) I am just going through the research behind this: http://jalammar.github.io/illustrated-retrieval-transformer/ and was wondering if there has been work like this for other domains say computer vision for example? submitted by /u/data-drone [link] [comments]
    [D] How to correctly use batch norm in pre-loaded architectures?
    (1 min) I have tried using a few pre-loaded architectures in Tensorflow including tf.keras.applications.efficientnet.EfficientNetB3 and tf.keras.applications.MobileNetV3Large . I am working on medical images so I am not any pretrained ImageNet weights. I manage to reach a decent accuracy during the training epoch (eg 75%), but my validation accuracy doesn't cross random performance (eg 3%). Upon closer inspection I have identified the cause as being the batch normalization layers - If I run the evaluation batches w/ training=True on the BatchNorm layers, I am able to reproduce the training set accuracy. I have tried playing with the momentum parameter and changing it from its default value of .999 to .75 or even .5 but with no effect. My question is what may be causing this and how can I fix it? On a larger note, do most people find that they can use pre-loaded architectures out-of-the-box or are there various parameters that they need to modify to get it to work properly? submitted by /u/rsandler [link] [comments]
  • Neural Networks, Deep Learning and Machine Learning

    Can a neural net be trained to create badass artwork from pinterest?
    (2 min) Example: If i trained a neural net on my own pinterest board full of manga/illustration style artwork - with a rough theme (fantasy/sci-fi/wizard type characters) could the neural net then be able to output random variations of these? If no - what would it take? Sorry if im thinking about this the wrong way, i'm still new to this. submitted by /u/skittleteeth [link] [comments]
    Is this the right use of an ANN?
    (1 min) I have made rudimentary ANN before but keep looking for ways to push my boundaries. I like AI and flight sims so why not an AI that helps me flight sim. :) I'm thinking of an ANN assistant (eg co-pilot) that learns to do key actions during flight. Inputs: On ground (Boolean) Altitude (numerical - normalised) Pitch (nose up/down/level) Outputs: Gear (up/down) Autopilot (on/off) Flaps (up/down) Lights (on/off) So, given the state of the aircraft at any given point, the ANN can manipulate (or recommend) the configuration of the aircraft and lighten the pilot work load (remembering this is a fun exercise). Assume my ANN can monitor and manipulate the flight sim (it has an interface to do this), is an ANN with appropriate weights and nodes a good fit (after training)? Should I be looking at Q-Table or something else? submitted by /u/Togfox [link] [comments]
    3dCNN question help appreciated
    (1 min) Im new to 3dCNN. Is it possible to use 3dCNN for action classification (like sitting, walking?) I have temperature data in csv this data can be turned into 16x16x4 images (RBG to show temp in different colors). My question is, do i need to turn each frame into an image? To then be turned into a video in order to feed it into the 3dCNN? If so, how can i optimize this? 1 action has several frames in the same csv file, and I have several files. Or can i just use temperature data in csv straight away? (Each file contains one action with several frames worth of temp data) Any help /links/ examples are greatly appreciated. submitted by /u/Miki_mallow [link] [comments]
  • Reinforcement Learning

    Help with my PyTorch implementation of PPO
    (1 min) Hi all, I implemented PPO using PyTorch here. As is suggested, I was trying it on very simple environment (CartPole-v1). My PPO doesn't seem to be very impressive. The reward seems to saturate somewhere around 25, while the best it can go is around 190s. Here is benchmark of all my experiments. There exist various implementations with various set of hyperparameters, and other improvement choices (Advantage using GAE, Clip critic loss, and so on). I have tried various mix of them, but sadly nothing seems to improve it. The various resources I have referred are: Phil's video: https://youtu.be/hlv79rcHws0video by wandb: https://youtu.be/MEt6rrxH8W4Hyperparameters Deep Reinforcement Learning that Matters: https://arxiv.org/pdf/1709.06560.pdf IMPLEMENTATION MATTERS IN DEEP POLICY GRADIENTS: A CASE STUDY ON PPO AND TRPO: https://openreview.net/attachment?id=r1etN1rtPB&name=original_pdf Can you help me getting it trained on this simple environment.I have implemented it in a way that it should be easy to run and understand. submitted by /u/harshraj22 [link] [comments]
    Normalizing advantage estimates in PPO
    (1 min) Hello everybody! One implementation detail of PPO is to normalize the advantage estimates. These are usually normalized using the mean and std so that the mean is 0 and the std is 1. Now, I came across 3 different implementations that apply the normalization to different amounts of data. More precisely: Normalizing advantages across single episodes (kamarl) Normalizing advantages across single mini batches (neroRL) Normalizing advantages across the entire batch (SpinningUp) Do you guys think that this detail has a higher impact? My first intuition is that the more data is used for normalization, the smoother the normalized advantages are. So when normalizing across single episodes, this could lead to a high variance causing harmfully large parameter updates, whereas the smooth normalization could hinder learning. submitted by /u/LilHairdy [link] [comments]
    Would you fine tune the hyperparameters or modify the reward shaping (customized env) to improve the performance of model-free algorithms like TD3?
    (1 min) submitted by /u/Rich_Beautiful [link] [comments]
    Is Reinforcement Learning just a special case of causal modeling?
    (6 min) Hi everyone - I am a data scientist who works a lot on causal inference problems. In particular, I build individual treatment effect (AKA Conditional Average Treatment Effects AKA Uplift) models to estimate the causal effect of some potential treatment(s) on an outcome of interest for an individual for the express purpose of giving the optimal "treatment" to a given individual. "Optimal" usually means the treatment that is estimated to cause the largest positive effect on the outcome. A typical example is optimizing marketing touches for individual customers: we have several "treatments" we can apply to a customer at a given time, such as targeting them for an online ad campaign, sending a push notification through our app, sending a marketing email, etc. and we want to make sure we are pu…
    Analyzing Skill Expression in Games (with the help of reinforcement learning)
    (17 min) submitted by /u/thechiamp [link] [comments]

2022-01-05

  • cs.LG updates on arXiv.org

    Identify comorbidities associated with recurrent ED and in-patient visits. (arXiv:2110.13769v2 [stat.ML] UPDATED)
    (0 min) In the hospital setting, a small percentage of recurrent frequent patients contribute to a disproportional amount of healthcare resource usage. Moreover, in many of these cases, patient outcomes can be greatly improved by reducing reoccurring visits, especially when they are associated with substance abuse, mental health, and medical factors that could be improved by social-behavioral interventions, outpatient or preventative care. Additionally, health care costs can be reduced significantly with fewer preventable recurrent visits. To address this, we developed a computationally efficient and interpretable framework that both identifies recurrent patients with high utilization and determines which comorbidities contribute most to their recurrent visits. Specifically, we present a novel algorithm, called the minimum similarity association rules (MSAR), balancing confidence-support trade-off, to determine the conditions most associated with reoccurring Emergency department (ED) and inpatient visits. We validate MSAR on a large Electric Health Record (EHR) dataset.
    GCN-Geo: A Graph Convolution Network-based Fine-grained IP Geolocation System. (arXiv:2112.10767v4 [cs.LG] UPDATED)
    (0 min) Fine-grained IP geolocation systems often rely on some linear delay-distance rules. They are not easy to generalize in network environments where the delay-distance relationship is non-linear. Recently, researchers begin to pay attention to learning-based IP geolocation systems. These data-driven algorithms leverage multi-layer perceptron (MLP) to model non-linear relationships. However, MLP is not so suitable for modeling computer networks because networks are fundamentally graph-typed data. MLP-based IP geolocation systems only treat IP addresses as isolated data instances, forgoing the connection information between IP addresses. This would lead to sub-optimal representations and limit the geolocation performance. Graph convolutional network (GCN) is an emerging deep learning method for graph-typed data presentation. In this work, we research how to model computer networks for fine-grained IP geolocation with GCN. First, we formulate the IP geolocation task as an attributed graph node regression problem. Then, a GCN-based IP geolocation system named GCN-Geo is proposed to predict the location of each IP address. GCN-Geo consists of a preprocessor, an encoder, graph convolutional (GC) layers and a decoder. The preprocessor and the encoder transform raw measurement data into the initial graph embeddings. GC layers refine the initial graph node embeddings by explicitly modeling the connection information between IP addresses. The proposed decoder can relieve the converging problem of GCN-Geo by considering some prior knowledge about target IP addresses. Finally, the experimental results in three real-world datasets show that: GCN-Geo clearly outperforms the state-of-art rule-based and learning-based baselines on all three datasets w.r.t. average, median and max error distances. This work verifies the potential of GCN in fine-grained IP geolocation.
    A Comprehensive Survey of Logging in Software: From Logging Statements Automation to Log Mining and Analysis. (arXiv:2110.12489v2 [cs.SE] UPDATED)
    (0 min) Logs are widely used to record runtime information of software systems, such as the timestamp and the importance of an event, the unique ID of the source of the log, and a part of the state of a task's execution. The rich information of logs enables system developers (and operators) to monitor the runtime behaviors of their systems and further track down system problems and perform analysis on log data in production settings. However, the prior research on utilizing logs is scattered and that limits the ability of new researchers in this field to quickly get to the speed and hampers currently active researchers to advance this field further. Therefore, this paper surveys and provides a systematic literature review and mapping of the contemporary logging practices and log statements' mining and monitoring techniques and their applications such as in system failure detection and diagnosis. We study a large number of conference and journal papers that appeared on top-level peer-reviewed venues. Additionally, we draw high-level trends of ongoing research and categorize publications into subdivisions. In the end, and based on our holistic observations during this survey, we provide a set of challenges and opportunities that will lead the researchers in academia and industry in moving the field forward.
    Fracture Detection in Wrist X-ray Images Using Deep Learning-Based Object Detection Models. (arXiv:2111.07355v2 [eess.IV] UPDATED)
    (0 min) Wrist fractures are common cases in hospitals, particularly in emergency services. Physicians need images from various medical devices, and patients medical history and physical examination to diagnose these fractures correctly and apply proper treatment. This study aims to perform fracture detection using deep learning on wrist Xray images to assist physicians not specialized in the field, working in emergency services in particular, in diagnosis of fractures. For this purpose, 20 different detection procedures were performed using deep learning based object detection models on dataset of wrist Xray images obtained from Gazi University Hospital. DCN, Dynamic R_CNN, Faster R_CNN, FSAF, Libra R_CNN, PAA, RetinaNet, RegNet and SABL deep learning based object detection models with various backbones were used herein. To further improve detection procedures in the study, 5 different ensemble models were developed, which were later used to reform an ensemble model to develop a detection model unique to our study, titled wrist fracture detection combo (WFD_C). Based on 26 different models for fracture detection, the highest result of detection was 0.8639 average precision (AP50) in WFD_C model developed. This study is supported by Huawei Turkey R&D Center within the scope of the ongoing cooperation project coded 071813 among Gazi University, Huawei and Medskor.
    SoK: Privacy-preserving Deep Learning with Homomorphic Encryption. (arXiv:2112.12855v2 [cs.CR] UPDATED)
    (0 min) Outsourced computation for neural networks allows users access to state of the art models without needing to invest in specialized hardware and know-how. The problem is that the users lose control over potentially privacy sensitive data. With homomorphic encryption (HE) computation can be performed on encrypted data without revealing its content. In this systematization of knowledge, we take an in-depth look at approaches that combine neural networks with HE for privacy preservation. We categorize the changes to neural network models and architectures to make them computable over HE and how these changes impact performance. We find numerous challenges to HE based privacy-preserving deep learning such as computational overhead, usability, and limitations posed by the encryption schemes.
    Mimicking the Oracle: An Initial Phase Decorrelation Approach for Class Incremental Learning. (arXiv:2112.04731v3 [cs.CV] UPDATED)
    (0 min) Class Incremental Learning (CIL) aims at learning a multi-class classifier in a phase-by-phase manner, in which only data of a subset of the classes are provided at each phase. Previous works mainly focus on mitigating forgetting in phases after the initial one. However, we find that improving CIL at its initial phase is also a promising direction. Specifically, we experimentally show that directly encouraging CIL Learner at the initial phase to output similar representations as the model jointly trained on all classes can greatly boost the CIL performance. Motivated by this, we study the difference between a na\"ively-trained initial-phase model and the oracle model. Specifically, since one major difference between these two models is the number of training classes, we investigate how such difference affects the model representations. We find that, with fewer training classes, the data representations of each class lie in a long and narrow region; with more training classes, the representations of each class scatter more uniformly. Inspired by this observation, we propose Class-wise Decorrelation (CwD) that effectively regularizes representations of each class to scatter more uniformly, thus mimicking the model jointly trained with all classes (i.e., the oracle model). Our CwD is simple to implement and easy to plug into existing methods. Extensive experiments on various benchmark datasets show that CwD consistently and significantly improves the performance of existing state-of-the-art methods by around 1\% to 3\%. Code will be released.
    DSelect-k: Differentiable Selection in the Mixture of Experts with Applications to Multi-Task Learning. (arXiv:2106.03760v3 [cs.LG] UPDATED)
    (0 min) The Mixture-of-Experts (MoE) architecture is showing promising results in improving parameter sharing in multi-task learning (MTL) and in scaling high-capacity neural networks. State-of-the-art MoE models use a trainable sparse gate to select a subset of the experts for each input example. While conceptually appealing, existing sparse gates, such as Top-k, are not smooth. The lack of smoothness can lead to convergence and statistical performance issues when training with gradient-based methods. In this paper, we develop DSelect-k: a continuously differentiable and sparse gate for MoE, based on a novel binary encoding formulation. The gate can be trained using first-order methods, such as stochastic gradient descent, and offers explicit control over the number of experts to select. We demonstrate the effectiveness of DSelect-k on both synthetic and real MTL datasets with up to $128$ tasks. Our experiments indicate that DSelect-k can achieve statistically significant improvements in prediction and expert selection over popular MoE gates. Notably, on a real-world, large-scale recommender system, DSelect-k achieves over $22\%$ improvement in predictive performance compared to Top-k. We provide an open-source implementation of DSelect-k.
    On the Provable Generalization of Recurrent Neural Networks. (arXiv:2109.14142v3 [cs.LG] UPDATED)
    (0 min) Recurrent Neural Network (RNN) is a fundamental structure in deep learning. Recently, some works study the training process of over-parameterized neural networks, and show that over-parameterized networks can learn functions in some notable concept classes with a provable generalization error bound. In this paper, we analyze the training and generalization for RNNs with random initialization, and provide the following improvements over recent works: 1) For a RNN with input sequence $x=(X_1,X_2,...,X_L)$, previous works study to learn functions that are summation of $f(\beta^T_lX_l)$ and require normalized conditions that $||X_l||\leq\epsilon$ with some very small $\epsilon$ depending on the complexity of $f$. In this paper, using detailed analysis about the neural tangent kernel matrix, we prove a generalization error bound to learn such functions without normalized conditions and show that some notable concept classes are learnable with the numbers of iterations and samples scaling almost-polynomially in the input length $L$. 2) Moreover, we prove a novel result to learn N-variables functions of input sequence with the form $f(\beta^T[X_{l_1},...,X_{l_N}])$, which do not belong to the "additive" concept class, i,e., the summation of function $f(X_l)$. And we show that when either $N$ or $l_0=\max(l_1,..,l_N)-\min(l_1,..,l_N)$ is small, $f(\beta^T[X_{l_1},...,X_{l_N}])$ will be learnable with the number iterations and samples scaling almost-polynomially in the input length $L$.
    Layer-wise Adaptive Model Aggregation for Scalable Federated Learning. (arXiv:2110.10302v3 [cs.LG] UPDATED)
    (0 min) In Federated Learning, a common approach for aggregating local models across clients is periodic averaging of the full model parameters. It is, however, known that different layers of neural networks can have a different degree of model discrepancy across the clients. The conventional full aggregation scheme does not consider such a difference and synchronizes the whole model parameters at once, resulting in inefficient network bandwidth consumption. Aggregating the parameters that are similar across the clients does not make meaningful training progress while increasing the communication cost. We propose FedLAMA, a layer-wise model aggregation scheme for scalable Federated Learning. FedLAMA adaptively adjusts the aggregation interval in a layer-wise manner, jointly considering the model discrepancy and the communication cost. The layer-wise aggregation method enables to finely control the aggregation interval to relax the aggregation frequency without a significant impact on the model accuracy. Our empirical study shows that FedLAMA reduces the communication cost by up to 60% for IID data and 70% for non-IID data while achieving a comparable accuracy to FedAvg.
    On Optimizing Interventions in Shared Autonomy. (arXiv:2112.09169v2 [cs.AI] UPDATED)
    (0 min) Shared autonomy refers to approaches for enabling an autonomous agent to collaborate with a human with the aim of improving human performance. However, besides improving performance, it may often also be beneficial that the agent concurrently accounts for preserving the user's experience or satisfaction of collaboration. In order to address this additional goal, we examine approaches for improving the user experience by constraining the number of interventions by the autonomous agent. We propose two model-free reinforcement learning methods that can account for both hard and soft constraints on the number of interventions. We show that not only does our method outperform the existing baseline, but also eliminates the need to manually tune a black-box hyperparameter for controlling the level of assistance. We also provide an in-depth analysis of intervention scenarios in order to further illuminate system understanding.
    User-Level Label Leakage from Gradients in Federated Learning. (arXiv:2105.09369v4 [cs.CR] UPDATED)
    (0 min) Federated learning enables multiple users to build a joint model by sharing their model updates (gradients), while their raw data remains local on their devices. In contrast to the common belief that this provides privacy benefits, we here add to the very recent results on privacy risks when sharing gradients. Specifically, we investigate Label Leakage from Gradients (LLG), a novel attack to extract the labels of the users' training data from their shared gradients. The attack exploits the direction and magnitude of gradients to determine the presence or absence of any label. LLG is simple yet effective, capable of leaking potential sensitive information represented by labels, and scales well to arbitrary batch sizes and multiple classes. We mathematically and empirically demonstrate the validity of the attack under different settings. Moreover, empirical results show that LLG successfully extracts labels with high accuracy at the early stages of model training. We also discuss different defense mechanisms against such leakage. Our findings suggest that gradient compression is a practical technique to mitigate the attack.
    Provably Efficient Reinforcement Learning with Linear Function Approximation Under Adaptivity Constraints. (arXiv:2101.02195v2 [cs.LG] UPDATED)
    (0 min) We study reinforcement learning (RL) with linear function approximation under the adaptivity constraint. We consider two popular limited adaptivity models: the batch learning model and the rare policy switch model, and propose two efficient online RL algorithms for episodic linear Markov decision processes, where the transition probability and the reward function can be represented as a linear function of some known feature mapping. In specific, for the batch learning model, our proposed LSVI-UCB-Batch algorithm achieves an $\tilde O(\sqrt{d^3H^3T} + dHT/B)$ regret, where $d$ is the dimension of the feature mapping, $H$ is the episode length, $T$ is the number of interactions and $B$ is the number of batches. Our result suggests that it suffices to use only $\sqrt{T/dH}$ batches to obtain $\tilde O(\sqrt{d^3H^3T})$ regret. For the rare policy switch model, our proposed LSVI-UCB-RareSwitch algorithm enjoys an $\tilde O(\sqrt{d^3H^3T[1+T/(dH)]^{dH/B}})$ regret, which implies that $dH\log T$ policy switches suffice to obtain the $\tilde O(\sqrt{d^3H^3T})$ regret. Our algorithms achieve the same regret as the LSVI-UCB algorithm (Jin et al., 2019), yet with a substantially smaller amount of adaptivity. We also establish a lower bound for the batch learning model, which suggests that the dependency on $B$ in our regret bound is tight.
    BM-NAS: Bilevel Multimodal Neural Architecture Search. (arXiv:2104.09379v2 [cs.CV] UPDATED)
    (0 min) Deep neural networks (DNNs) have shown superior performances on various multimodal learning problems. However, it often requires huge efforts to adapt DNNs to individual multimodal tasks by manually engineering unimodal features and designing multimodal feature fusion strategies. This paper proposes Bilevel Multimodal Neural Architecture Search (BM-NAS) framework, which makes the architecture of multimodal fusion models fully searchable via a bilevel searching scheme. At the upper level, BM-NAS selects the inter/intra-modal feature pairs from the pretrained unimodal backbones. At the lower level, BM-NAS learns the fusion strategy for each feature pair, which is a combination of predefined primitive operations. The primitive operations are elaborately designed and they can be flexibly combined to accommodate various effective feature fusion modules such as multi-head attention (Transformer) and Attention on Attention (AoA). Experimental results on three multimodal tasks demonstrate the effectiveness and efficiency of the proposed BM-NAS framework. BM-NAS achieves competitive performances with much less search time and fewer model parameters in comparison with the existing generalized multimodal NAS methods.
    Nearly Minimax Optimal Reinforcement Learning for Discounted MDPs. (arXiv:2010.00587v3 [cs.LG] UPDATED)
    (0 min) We study the reinforcement learning problem for discounted Markov Decision Processes (MDPs) under the tabular setting. We propose a model-based algorithm named UCBVI-$\gamma$, which is based on the \emph{optimism in the face of uncertainty principle} and the Bernstein-type bonus. We show that UCBVI-$\gamma$ achieves an $\tilde{O}\big({\sqrt{SAT}}/{(1-\gamma)^{1.5}}\big)$ regret, where $S$ is the number of states, $A$ is the number of actions, $\gamma$ is the discount factor and $T$ is the number of steps. In addition, we construct a class of hard MDPs and show that for any algorithm, the expected regret is at least $\tilde{\Omega}\big({\sqrt{SAT}}/{(1-\gamma)^{1.5}}\big)$. Our upper bound matches the minimax lower bound up to logarithmic factors, which suggests that UCBVI-$\gamma$ is nearly minimax optimal for discounted MDPs.
    NCVX: A User-Friendly and Scalable Package for Nonconvex Optimization in Machine Learning. (arXiv:2111.13984v2 [cs.LG] UPDATED)
    (0 min) Optimizing nonconvex (NCVX) problems, especially nonsmooth and constrained ones, is an essential part of machine learning. However, it can be hard to reliably solve such problems without optimization expertise. Existing general-purpose NCVX optimization packages are powerful but typically cannot handle nonsmoothness. GRANSO is among the first optimization solvers targeting general nonsmooth NCVX problems with nonsmooth constraints, but, as it is implemented in MATLAB and requires the user to provide analytical gradients, GRANSO is often not a convenient choice in machine learning (especially deep learning) applications. To greatly lower the technical barrier, we introduce a new software package called NCVX, whose initial release contains the solver PyGRANSO, a PyTorch-enabled port of GRANSO incorporating auto-differentiation, GPU acceleration, tensor input, and support for new QP solvers. NCVX is built on freely available and widely used open-source frameworks, and as a highlight, can solve general constrained deep learning problems, the first of its kind. NCVX is available at https://ncvx.org, with detailed documentation and numerous examples from machine learning and other fields.
    Graph Pointer Neural Networks. (arXiv:2110.00973v2 [cs.LG] UPDATED)
    (0 min) Graph Neural Networks (GNNs) have shown advantages in various graph-based applications. Most existing GNNs assume strong homophily of graph structure and apply permutation-invariant local aggregation of neighbors to learn a representation for each node. However, they fail to generalize to heterophilic graphs, where most neighboring nodes have different labels or features, and the relevant nodes are distant. Few recent studies attempt to address this problem by combining multiple hops of hidden representations of central nodes (i.e., multi-hop-based approaches) or sorting the neighboring nodes based on attention scores (i.e., ranking-based approaches). As a result, these approaches have some apparent limitations. On the one hand, multi-hop-based approaches do not explicitly distinguish relevant nodes from a large number of multi-hop neighborhoods, leading to a severe over-smoothing problem. On the other hand, ranking-based models do not joint-optimize node ranking with end tasks and result in sub-optimal solutions. In this work, we present Graph Pointer Neural Networks (GPNN) to tackle the challenges mentioned above. We leverage a pointer network to select the most relevant nodes from a large amount of multi-hop neighborhoods, which constructs an ordered sequence according to the relationship with the central node. 1D convolution is then applied to extract high-level features from the node sequence. The pointer-network-based ranker in GPNN is joint-optimized with other parts in an end-to-end manner. Extensive experiments are conducted on six public node classification datasets with heterophilic graphs. The results show that GPNN significantly improves the classification performance of state-of-the-art methods. In addition, analyses also reveal the privilege of the proposed GPNN in filtering out irrelevant neighbors and reducing over-smoothing.
    A Survey of Generalisation in Deep Reinforcement Learning. (arXiv:2111.09794v3 [cs.LG] UPDATED)
    (0 min) The study of generalisation in deep Reinforcement Learning (RL) aims to produce RL algorithms whose policies generalise well to novel unseen situations at deployment time, avoiding overfitting to their training environments. Tackling this is vital if we are to deploy reinforcement learning algorithms in real world scenarios, where the environment will be diverse, dynamic and unpredictable. This survey is an overview of this nascent field. We provide a unifying formalism and terminology for discussing different generalisation problems, building upon previous works. We go on to categorise existing benchmarks for generalisation, as well as current methods for tackling the generalisation problem. Finally, we provide a critical discussion of the current state of the field, including recommendations for future work. Among other conclusions, we argue that taking a purely procedural content generation approach to benchmark design is not conducive to progress in generalisation, we suggest fast online adaptation and tackling RL-specific problems as some areas for future work on methods for generalisation, and we recommend building benchmarks in underexplored problem settings such as offline RL generalisation and reward-function variation.
    Sample Complexity of Learning Parametric Quantum Circuits. (arXiv:2107.09078v2 [quant-ph] UPDATED)
    (0 min) Quantum computers hold unprecedented potentials for machine learning applications. Here, we prove that physical quantum circuits are PAC (probably approximately correct) learnable on a quantum computer via empirical risk minimization: to learn a parametric quantum circuit with at most $n^c$ gates and each gate acting on a constant number of qubits, the sample complexity is bounded by $\tilde{O}(n^{c+1})$. In particular, we explicitly construct a family of variational quantum circuits with $O(n^{c+1})$ elementary gates arranged in a fixed pattern, which can represent all physical quantum circuits consisting of at most $n^c$ elementary gates. Our results provide a valuable guide for quantum machine learning in both theory and practice.
    High-dimensional Bayesian Optimization Algorithm with Recurrent Neural Network for Disease Control Models in Time Series. (arXiv:2201.00147v1 [cs.LG])
    (0 min) Bayesian Optimization algorithm has become a promising approach for nonlinear global optimization problems and many machine learning applications. Over the past few years, improvements and enhancements have been brought forward and they have shown some promising results in solving the complex dynamic problems, systems of ordinary differential equations where the objective functions are computationally expensive to evaluate. Besides, the straightforward implementation of the Bayesian Optimization algorithm performs well merely for optimization problems with 10-20 dimensions. The study presented in this paper proposes a new high dimensional Bayesian Optimization algorithm combining Recurrent neural networks, which is expected to predict the optimal solution for the global optimization problems with high dimensional or time series decision models. The proposed RNN-BO algorithm can solve the optimal control problems in the lower dimension space and then learn from the historical data using the recurrent neural network to learn the historical optimal solution data and predict the optimal control strategy for any new initial system value setting. In addition, accurately and quickly providing the optimal control strategy is essential to effectively and efficiently control the epidemic spread while minimizing the associated financial costs. Therefore, to verify the effectiveness of the proposed algorithm, computational experiments are carried out on a deterministic SEIR epidemic model and a stochastic SIS optimal control model. Finally, we also discuss the impacts of different numbers of the RNN layers and training epochs on the trade-off between solution quality and related computational efforts.
    FUSeg: The Foot Ulcer Segmentation Challenge. (arXiv:2201.00414v1 [eess.IV])
    (0 min) Acute and chronic wounds with varying etiologies burden the healthcare systems economically. The advanced wound care market is estimated to reach $22 billion by 2024. Wound care professionals provide proper diagnosis and treatment with heavy reliance on images and image documentation. Segmentation of wound boundaries in images is a key component of the care and diagnosis protocol since it is important to estimate the area of the wound and provide quantitative measurement for the treatment. Unfortunately, this process is very time-consuming and requires a high level of expertise. Recently automatic wound segmentation methods based on deep learning have shown promising performance but require large datasets for training and it is unclear which methods perform better. To address these issues, we propose the Foot Ulcer Segmentation challenge (FUSeg) organized in conjunction with the 2021 International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI). We built a wound image dataset containing 1,210 foot ulcer images collected over 2 years from 889 patients. It is pixel-wise annotated by wound care experts and split into a training set with 1010 images and a testing set with 200 images for evaluation. Teams around the world developed automated methods to predict wound segmentations on the testing set of which annotations were kept private. The predictions were evaluated and ranked based on the average Dice coefficient. The FUSeg challenge remains an open challenge as a benchmark for wound segmentation after the conference.
    Deep Representation Learning with an Information-theoretic Loss. (arXiv:2111.12950v3 [cs.LG] UPDATED)
    (0 min) This paper proposes an information-theoretic loss for learning deep neural networks. We propose a loss function based on the Information Bottleneck principle and a max-margin loss with an aim to increase the class separability in the embedded space. While deep neural network models have excelled in supervised learning tasks with large-scale labeled data available, they are prone to practical issues in testing samples outside of classes shown in training, e.g., anomaly detection and out-of-distribution detection. In such tasks, it is not sufficient to merely discriminate between known classes. Our intuition is to represent the known classes in compact and separated embedded regions in order to decrease the possibility of known and unseen classes largely overlapping in the embedded space. We show that the IB-based loss function reflects the inter-class distances as well as the compactness within classes, thus will extend the extending models of the existing deep data description models. Our empirical study shows that the proposed model improves the segmentation of normal classes in the deep feature space which contributes to identifying the out-of-distribution samples.
    Rank-1 Similarity Matrix Decomposition For Modeling Changes in Antivirus Consensus Through Time. (arXiv:2201.00757v1 [cs.CR])
    (0 min) Although groups of strongly correlated antivirus engines are known to exist, at present there is limited understanding of how or why these correlations came to be. Using a corpus of 25 million VirusTotal reports representing over a decade of antivirus scan data, we challenge prevailing wisdom that these correlations primarily originate from "first-order" interactions such as antivirus vendors copying the labels of leading vendors. We introduce the Temporal Rank-1 Similarity Matrix decomposition (R1SM-T) in order to investigate the origins of these correlations and to model how consensus amongst antivirus engines changes over time. We reveal that first-order interactions do not explain as much behavior in antivirus correlation as previously thought, and that the relationships between antivirus engines are highly volatile. We make recommendations on items in need of future study and consideration based on our findings.
    Feature matching as improved transfer learning technique for wearable EEG. (arXiv:2201.00644v1 [eess.SP])
    (0 min) Objective: With the rapid rise of wearable sleep monitoring devices with non-conventional electrode configurations, there is a need for automated algorithms that can perform sleep staging on configurations with small amounts of labeled data. Transfer learning has the ability to adapt neural network weights from a source modality (e.g. standard electrode configuration) to a new target modality (e.g. non-conventional electrode configuration). Methods: We propose feature matching, a new transfer learning strategy as an alternative to the commonly used finetuning approach. This method consists of training a model with larger amounts of data from the source modality and few paired samples of source and target modality. For those paired samples, the model extracts features of the target modality, matching these to the features from the corresponding samples of the source modality. Results: We compare feature matching to finetuning for three different target domains, with two different neural network architectures, and with varying amounts of training data. Particularly on small cohorts (i.e. 2 - 5 labeled recordings in the non-conventional recording setting), feature matching systematically outperforms finetuning with mean relative differences in accuracy ranging from 0.4% to 4.7% for the different scenarios and datasets. Conclusion: Our findings suggest that feature matching outperforms finetuning as a transfer learning approach, especially in very low data regimes. Significance: As such, we conclude that feature matching is a promising new method for wearable sleep staging with novel devices.
    Deep Learning Applications for Lung Cancer Diagnosis: A systematic review. (arXiv:2201.00227v1 [eess.IV])
    (0 min) Lung cancer has been one of the most prevalent disease in recent years. According to the research of this field, more than 200,000 cases are identified each year in the US. Uncontrolled multiplication and growth of the lung cells result in malignant tumour formation. Recently, deep learning algorithms, especially Convolutional Neural Networks (CNN), have become a superior way to automatically diagnose disease. The purpose of this article is to review different models that lead to different accuracy and sensitivity in the diagnosis of early-stage lung cancer and to help physicians and researchers in this field. The main purpose of this work is to identify the challenges that exist in lung cancer based on deep learning. The survey is systematically written that combines regular mapping and literature review to review 32 conference and journal articles in the field from 2016 to 2021. After analysing and reviewing the articles, the questions raised in the articles are being answered. This research is superior to other review articles in this field due to the complete review of relevant articles and systematic write up.
    Generalization Bounds for Noisy Iterative Algorithms Using Properties of Additive Noise Channels. (arXiv:2102.02976v3 [stat.ML] UPDATED)
    (0 min) Machine learning models trained by different optimization algorithms under different data distributions can exhibit distinct generalization behaviors. In this paper, we analyze the generalization of models trained by noisy iterative algorithms. We derive distribution-dependent generalization bounds by connecting noisy iterative algorithms to additive noise channels found in communication and information theory. Our generalization bounds shed light on several applications, including differentially private stochastic gradient descent (DP-SGD), federated learning, and stochastic gradient Langevin dynamics (SGLD). We demonstrate our bounds through numerical experiments, showing that they can help understand recent empirical observations of the generalization phenomena of neural networks.
    Provable Meta-Learning of Linear Representations. (arXiv:2002.11684v5 [cs.LG] UPDATED)
    (0 min) Meta-learning, or learning-to-learn, seeks to design algorithms that can utilize previous experience to rapidly learn new skills or adapt to new environments. Representation learning -- a key tool for performing meta-learning -- learns a data representation that can transfer knowledge across multiple tasks, which is essential in regimes where data is scarce. Despite a recent surge of interest in the practice of meta-learning, the theoretical underpinnings of meta-learning algorithms are lacking, especially in the context of learning transferable representations. In this paper, we focus on the problem of multi-task linear regression -- in which multiple linear regression models share a common, low-dimensional linear representation. Here, we provide provably fast, sample-efficient algorithms to address the dual challenges of (1) learning a common set of features from multiple, related tasks, and (2) transferring this knowledge to new, unseen tasks. Both are central to the general problem of meta-learning. Finally, we complement these results by providing information-theoretic lower bounds on the sample complexity of learning these linear features.
    Audiogmenter: a MATLAB Toolbox for Audio Data Augmentation. (arXiv:1912.05472v4 [eess.AS] UPDATED)
    (0 min) Audio data augmentation is a key step in training deep neural networks for solving audio classification tasks. In this paper, we introduce Audiogmenter, a novel audio data augmentation library in MATLAB. We provide 15 different augmentation algorithms for raw audio data and 8 for spectrograms. We efficiently implemented several augmentation techniques whose usefulness has been extensively proved in the literature. To the best of our knowledge, this is the largest MATLAB audio data augmentation library freely available. We validate the efficiency of our algorithms evaluating them on the ESC-50 dataset. The toolbox and its documentation can be downloaded at https://github.com/LorisNanni/Audiogmenter.
    VDPC: Variational Density Peak Clustering Algorithm. (arXiv:2201.00641v1 [cs.LG])
    (0 min) The widely applied density peak clustering (DPC) algorithm makes an intuitive cluster formation assumption that cluster centers are often surrounded by data points with lower local density and far away from other data points with higher local density. However, this assumption suffers from one limitation that it is often problematic when identifying clusters with lower density because they might be easily merged into other clusters with higher density. As a result, DPC may not be able to identify clusters with variational density. To address this issue, we propose a variational density peak clustering (VDPC) algorithm, which is designed to systematically and autonomously perform the clustering task on datasets with various types of density distributions. Specifically, we first propose a novel method to identify the representatives among all data points and construct initial clusters based on the identified representatives for further analysis of the clusters' property. Furthermore, we divide all data points into different levels according to their local density and propose a unified clustering framework by combining the advantages of both DPC and DBSCAN. Thus, all the identified initial clusters spreading across different density levels are systematically processed to form the final clusters. To evaluate the effectiveness of the proposed VDPC algorithm, we conduct extensive experiments using 20 datasets including eight synthetic, six real-world and six image datasets. The experimental results show that VDPC outperforms two classical algorithms (i.e., DPC and DBSCAN) and four state-of-the-art extended DPC algorithms.
    SLURP: Side Learning Uncertainty for Regression Problems. (arXiv:2110.11182v2 [cs.CV] UPDATED)
    (0 min) It has become critical for deep learning algorithms to quantify their output uncertainties to satisfy reliability constraints and provide accurate results. Uncertainty estimation for regression has received less attention than classification due to the more straightforward standardized output of the latter class of tasks and their high importance. However, regression problems are encountered in a wide range of applications in computer vision. We propose SLURP, a generic approach for regression uncertainty estimation via a side learner that exploits the output and the intermediate representations generated by the main task model. We test SLURP on two critical regression tasks in computer vision: monocular depth and optical flow estimation. In addition, we conduct exhaustive benchmarks comprising transfer to different datasets and the addition of aleatoric noise. The results show that our proposal is generic and readily applicable to various regression problems and has a low computational cost with respect to existing solutions.
    Continuous Submodular Maximization: Boosting via Non-oblivious Function. (arXiv:2201.00703v1 [cs.LG])
    (0 min) In this paper, we revisit the constrained and stochastic continuous submodular maximization in both offline and online settings. For each $\gamma$-weakly DR-submodular function $f$, we use the factor-revealing optimization equation to derive an optimal auxiliary function $F$, whose stationary points provide a $(1-e^{-\gamma})$-approximation to the global maximum value (denoted as $OPT$) of problem $\max_{\boldsymbol{x}\in\mathcal{C}}f(\boldsymbol{x})$. Naturally, the projected (mirror) gradient ascent relied on this non-oblivious function achieves $(1-e^{-\gamma}-\epsilon^{2})OPT-\epsilon$ after $O(1/\epsilon^{2})$ iterations, beating the traditional $(\frac{\gamma^{2}}{1+\gamma^{2}})$-approximation gradient ascent \citep{hassani2017gradient} for submodular maximization. Similarly, based on $F$, the classical Frank-Wolfe algorithm equipped with variance reduction technique \citep{mokhtari2018conditional} also returns a solution with objective value larger than $(1-e^{-\gamma}-\epsilon^{2})OPT-\epsilon$ after $O(1/\epsilon^{3})$ iterations. In the online setting, we first consider the adversarial delays for stochastic gradient feedback, under which we propose a boosting online gradient algorithm with the same non-oblivious search, achieving a regret of $\sqrt{D}$ (where $D$ is the sum of delays of gradient feedback) against a $(1-e^{-\gamma})$-approximation to the best feasible solution in hindsight. Finally, extensive numerical experiments demonstrate the efficiency of our boosting methods.
    Vision Transformer Slimming: Multi-Dimension Searching in Continuous Optimization Space. (arXiv:2201.00814v1 [cs.CV])
    (0 min) This paper explores the feasibility of finding an optimal sub-model from a vision transformer and introduces a pure vision transformer slimming (ViT-Slim) framework that can search such a sub-structure from the original model end-to-end across multiple dimensions, including the input tokens, MHSA and MLP modules with state-of-the-art performance. Our method is based on a learnable and unified l1 sparsity constraint with pre-defined factors to reflect the global importance in the continuous searching space of different dimensions. The searching process is highly efficient through a single-shot training scheme. For instance, on DeiT-S, ViT-Slim only takes ~43 GPU hours for searching process, and the searched structure is flexible with diverse dimensionalities in different modules. Then, a budget threshold is employed according to the requirements of accuracy-FLOPs trade-off on running devices, and a re-training process is performed to obtain the final models. The extensive experiments show that our ViT-Slim can compress up to 40% of parameters and 40% FLOPs on various vision transformers while increasing the accuracy by ~0.6% on ImageNet. We also demonstrate the advantage of our searched models on several downstream datasets. Our source code will be publicly available.
    Detecting Misclassification Errors in Neural Networks with a Gaussian Process Model. (arXiv:2010.02065v4 [cs.LG] UPDATED)
    (0 min) As neural network classifiers are deployed in real-world applications, it is crucial that their failures can be detected reliably. One practical solution is to assign confidence scores to each prediction, then use these scores to filter out possible misclassifications. However, existing confidence metrics are not yet sufficiently reliable for this role. This paper presents a new framework that produces a quantitative metric for detecting misclassification errors. This framework, RED, builds an error detector on top of the base classifier and estimates uncertainty of the detection scores using Gaussian Processes. Experimental comparisons with other error detection methods on 125 UCI datasets demonstrate that this approach is effective. Further implementations on two probabilistic base classifiers and two large deep learning architecture in vision tasks further confirm that the method is robust and scalable. Third, an empirical analysis of RED with out-of-distribution and adversarial samples shows that the method can be used not only to detect errors but also to understand where they come from. RED can thereby be used to improve trustworthiness of neural network classifiers more broadly in the future.
    Machine learning moment closure models for the radiative transfer equation I: directly learning a gradient based closure. (arXiv:2105.05690v2 [math.NA] UPDATED)
    (0 min) In this paper, we take a data-driven approach and apply machine learning to the moment closure problem for radiative transfer equation in slab geometry. Instead of learning the unclosed high order moment, we propose to directly learn the gradient of the high order moment using neural networks. This new approach is consistent with the exact closure we derive for the free streaming limit and also provides a natural output normalization. A variety of benchmark tests, including the variable scattering problem, the Gaussian source problem with both periodic and reflecting boundaries, and the two-material problem, show both good accuracy and generalizability of our machine learning closure model.
    Estimating Causal Effects of Multi-Aspect Online Reviews with Multi-Modal Proxies. (arXiv:2112.10274v2 [cs.LG] UPDATED)
    (0 min) Online reviews enable consumers to engage with companies and provide important feedback. Due to the complexity of the high-dimensional text, these reviews are often simplified as a single numerical score, e.g., ratings or sentiment scores. This work empirically examines the causal effects of user-generated online reviews on a granular level: we consider multiple aspects, e.g., the Food and Service of a restaurant. Understanding consumers' opinions toward different aspects can help evaluate business performance in detail and strategize business operations effectively. Specifically, we aim to answer interventional questions such as What will the restaurant popularity be if the quality w.r.t. its aspect Service is increased by 10%? The defining challenge of causal inference with observational data is the presence of "confounder", which might not be observed or measured, e.g., consumers' preference to food type, rendering the estimated effects biased and high-variance. To address this challenge, we have recourse to the multi-modal proxies such as the consumer profile information and interactions between consumers and businesses. We show how to effectively leverage the rich information to identify and estimate causal effects of multiple aspects embedded in online reviews. Empirical evaluations on synthetic and real-world data corroborate the efficacy and shed light on the actionable insight of the proposed approach.
    An Attention Score Based Attacker for Black-box NLP Classifier. (arXiv:2112.11660v2 [cs.LG] UPDATED)
    (0 min) Deep neural networks have a wide range of applications in solving various real-world tasks and have achieved satisfactory results, in domains such as computer vision, image classification, and natural language processing. Meanwhile, the security and robustness of neural networks have become imperative, as diverse researches have shown the vulnerable aspects of neural networks. Case in point, in Natural language processing tasks, the neural network may be fooled by an attentively modified text, which has a high similarity to the original one. As per previous research, most of the studies are focused on the image domain; Different from image adversarial attacks, the text is represented in a discrete sequence, traditional image attack methods are not applicable in the NLP field. In this paper, we propose a word-level NLP sentiment classifier attack model, which includes a self-attention mechanism-based word selection method and a greedy search algorithm for word substitution. We experiment with our attack model by attacking GRU and 1D-CNN victim models on IMDB datasets. Experimental results demonstrate that our model achieves a higher attack success rate and more efficient than previous methods due to the efficient word selection algorithms are employed and minimized the word substitute number. Also, our model is transferable, which can be used in the image domain with several modifications.
    Execute Order 66: Targeted Data Poisoning for Reinforcement Learning. (arXiv:2201.00762v1 [cs.LG])
    (0 min) Data poisoning for reinforcement learning has historically focused on general performance degradation, and targeted attacks have been successful via perturbations that involve control of the victim's policy and rewards. We introduce an insidious poisoning attack for reinforcement learning which causes agent misbehavior only at specific target states - all while minimally modifying a small fraction of training observations without assuming any control over policy or reward. We accomplish this by adapting a recent technique, gradient alignment, to reinforcement learning. We test our method and demonstrate success in two Atari games of varying difficulty.
    Online Regularization towards Always-Valid High-Dimensional Dynamic Pricing. (arXiv:2007.02470v2 [stat.ML] UPDATED)
    (0 min) Devising dynamic pricing policy with always valid online statistical learning procedure is an important and as yet unresolved problem. Most existing dynamic pricing policy, which focus on the faithfulness of adopted customer choice models, exhibit a limited capability for adapting the online uncertainty of learned statistical model during pricing process. In this paper, we propose a novel approach for designing dynamic pricing policy based regularized online statistical learning with theoretical guarantees. The new approach overcomes the challenge of continuous monitoring of online Lasso procedure and possesses several appealing properties. In particular, we make the decisive observation that the always-validity of pricing decisions builds and thrives on the online regularization scheme. Our proposed online regularization scheme equips the proposed optimistic online regularized maximum likelihood pricing (OORMLP) pricing policy with three major advantages: encode market noise knowledge into pricing process optimism; empower online statistical learning with always-validity over all decision points; envelop prediction error process with time-uniform non-asymptotic oracle inequalities. This type of non-asymptotic inference results allows us to design more sample-efficient and robust dynamic pricing algorithms in practice. In theory, the proposed OORMLP algorithm exploits the sparsity structure of high-dimensional models and secures a logarithmic regret in a decision horizon. These theoretical advances are made possible by proposing an optimistic online Lasso procedure that resolves dynamic pricing problems at the process level, based on a novel use of non-asymptotic martingale concentration. In experiments, we evaluate OORMLP in different synthetic and real pricing problem settings, and demonstrate that OORMLP advances the state-of-the-art methods.
    Randomized Signature Layers for Signal Extraction in Time Series Data. (arXiv:2201.00384v1 [cs.LG])
    (0 min) Time series analysis is a widespread task in Natural Sciences, Social Sciences, and Engineering. A fundamental problem is finding an expressive yet efficient-to-compute representation of the input time series to use as a starting point to perform arbitrary downstream tasks. In this paper, we build upon recent works that use the Signature of a path as a feature map and investigate a computationally efficient technique to approximate these features based on linear random projections. We present several theoretical results to justify our approach and empirically validate that our random projections can effectively retrieve the underlying Signature of a path. We show the surprising performance of the proposed random features on several tasks, including (1) mapping the controls of stochastic differential equations to the corresponding solutions and (2) using the Randomized Signatures as time series representation for classification tasks. When compared to corresponding truncated Signature approaches, our Randomizes Signatures are more computationally efficient in high dimensions and often lead to better accuracy and faster training. Besides providing a new tool to extract Signatures and further validating the high level of expressiveness of such features, we believe our results provide interesting conceptual links between several existing research areas, suggesting new intriguing directions for future investigations.
    Neural combinatorial optimization beyond the TSP: Existing architectures under-represent graph structure. (arXiv:2201.00668v1 [cs.AI])
    (0 min) Recent years have witnessed the promise that reinforcement learning, coupled with Graph Neural Network (GNN) architectures, could learn to solve hard combinatorial optimization problems: given raw input data and an evaluator to guide the process, the idea is to automatically learn a policy able to return feasible and high-quality outputs. Recent work have shown promising results but the latter were mainly evaluated on the travelling salesman problem (TSP) and similar abstract variants such as Split Delivery Vehicle Routing Problem (SDVRP). In this paper, we analyze how and whether recent neural architectures can be applied to graph problems of practical importance. We thus set out to systematically "transfer" these architectures to the Power and Channel Allocation Problem (PCAP), which has practical relevance for, e.g., radio resource allocation in wireless networks. Our experimental results suggest that existing architectures (i) are still incapable of capturing graph structural features and (ii) are not suitable for problems where the actions on the graph change the graph attributes. On a positive note, we show that augmenting the structural representation of problems with Distance Encoding is a promising step towards the still-ambitious goal of learning multi-purpose autonomous solvers.
    Momentum Contrastive Voxel-wise Representation Learning for Semi-supervised Volumetric Medical Image Segmentation. (arXiv:2105.07059v3 [cs.CV] UPDATED)
    (0 min) Automated segmentation in medical image analysis is a challenging task that requires a large amount of manually labeled data. However, manually annotating medical data is often laborious, and most existing learning-based approaches fail to accurately delineate object boundaries without effective geometric constraints. Contrastive learning, a sub-area of self-supervised learning, has recently been noted as a promising direction in multiple application fields. In this work, we present a novel Contrastive Voxel-wise Representation Distillation (CVRD) method with geometric constraints to learn global-local visual representations for volumetric medical image segmentation with limited annotations. Our framework can effectively learn global and local features by capturing 3D spatial context and rich anatomical information. Specifically, we introduce a voxel-to-volume contrastive algorithm to learn global information from 3D images, and propose to perform local voxel-to-voxel distillation to explicitly make use of local cues in the embedding space. Moreover, we integrate an elastic interaction-based active contour model as a geometric regularization term to enable fast and reliable object delineations in an end-to-end learning manner. Results on the Atrial Segmentation Challenge dataset demonstrate superiority of our proposed scheme, especially in a setting with a very limited number of annotated data. The code will be available at https://github.com/charlesyou999648/CVRD.
    Robust Contrastive Learning Using Negative Samples with Diminished Semantics. (arXiv:2110.14189v2 [cs.CV] UPDATED)
    (0 min) Unsupervised learning has recently made exceptional progress because of the development of more effective contrastive learning methods. However, CNNs are prone to depend on low-level features that humans deem non-semantic. This dependency has been conjectured to induce a lack of robustness to image perturbations or domain shift. In this paper, we show that by generating carefully designed negative samples, contrastive learning can learn more robust representations with less dependence on such features. Contrastive learning utilizes positive pairs that preserve semantic information while perturbing superficial features in the training images. Similarly, we propose to generate negative samples in a reversed way, where only the superfluous instead of the semantic features are preserved. We develop two methods, texture-based and patch-based augmentations, to generate negative samples. These samples achieve better generalization, especially under out-of-domain settings. We also analyze our method and the generated texture-based samples, showing that texture features are indispensable in classifying particular ImageNet classes and especially finer classes. We also show that model bias favors texture and shape features differently under different test settings. Our code, trained models, and ImageNet-Texture dataset can be found at https://github.com/SongweiGe/Contrastive-Learning-with-Non-Semantic-Negatives.
    Snowflake: Scaling GNNs to High-Dimensional Continuous Control via Parameter Freezing. (arXiv:2103.01009v3 [cs.LG] UPDATED)
    (0 min) Recent research has shown that graph neural networks (GNNs) can learn policies for locomotion control that are as effective as a typical multi-layer perceptron (MLP), with superior transfer and multi-task performance (Wang et al., 2018; Huang et al., 2020). Results have so far been limited to training on small agents, with the performance of GNNs deteriorating rapidly as the number of sensors and actuators grows. A key motivation for the use of GNNs in the supervised learning setting is their applicability to large graphs, but this benefit has not yet been realised for locomotion control. We identify the weakness with a common GNN architecture that causes this poor scaling: overfitting in the MLPs within the network that encode, decode, and propagate messages. To combat this, we introduce Snowflake, a GNN training method for high-dimensional continuous control that freezes parameters in parts of the network that suffer from overfitting. Snowflake significantly boosts the performance of GNNs for locomotion control on large agents, now matching the performance of MLPs, and with superior transfer properties.
    Rethinking the Implementation Tricks and Monotonicity Constraint in Cooperative Multi-Agent Reinforcement Learning. (arXiv:2102.03479v18 [cs.LG] UPDATED)
    (0 min) Many complex multi-agent systems such as robot swarms control and autonomous vehicle coordination can be modeled as Multi-Agent Reinforcement Learning (MARL) tasks. QMIX, a widely popular MARL algorithm, has been used as a baseline for the benchmark environments, e.g., Starcraft Multi-Agent Challenge (SMAC), Difficulty-Enhanced Predator-Prey (DEPP). Recent variants of QMIX target relaxing the monotonicity constraint of QMIX, allowing for performance improvement in SMAC. In this paper, we investigate the code-level optimizations of these variants and the monotonicity constraint. (1) We find that such improvements of the variants are significantly affected by various code-level optimizations. (2) The experiment results show that QMIX with normalized optimizations outperforms other works in SMAC; (3) beyond the common wisdom from these works, the monotonicity constraint can improve sample efficiency in SMAC and DEPP. We also discuss why monotonicity constraints work well in purely cooperative tasks with a theoretical analysis. We open-source the code at \url{https://github.com/hijkzzz/pymarl2}.
    A Cluster-Based Trip Prediction Graph Neural Network Model for Bike Sharing Systems. (arXiv:2201.00720v1 [cs.LG])
    (0 min) Bike Sharing Systems (BSSs) are emerging as an innovative transportation service. Ensuring the proper functioning of a BSS is crucial given that these systems are committed to eradicating many of the current global concerns, by promoting environmental and economic sustainability and contributing to improving the life quality of the population. Good knowledge of users' transition patterns is a decisive contribution to the quality and operability of the service. The analogous and unbalanced users' transition patterns cause these systems to suffer from bicycle imbalance, leading to a drastic customer loss in the long term. Strategies for bicycle rebalancing become important to tackle this problem and for this, bicycle traffic prediction is essential, as it allows to operate more efficiently and to react in advance. In this work, we propose a bicycle trips predictor based on Graph Neural Network embeddings, taking into consideration station groupings, meteorology conditions, geographical distances, and trip patterns. We evaluated our approach in the New York City BSS (CitiBike) data and compared it with four baselines, including the non-clustered approach. To address our problem's specificities, we developed the Adaptive Transition Constraint Clustering Plus (AdaTC+) algorithm, eliminating shortcomings of previous work. Our experiments evidence the clustering pertinence (88% accuracy compared with 83% without clustering) and which clustering technique best suits this problem. Accuracy on the Link Prediction task is always higher for AdaTC+ than benchmark clustering methods when the stations are the same, while not degrading performance when the network is upgraded, in a mismatch with the trained model.
    AutoDES: AutoML Pipeline Generation of Classification with Dynamic Ensemble Strategy Selection. (arXiv:2201.00207v1 [cs.LG])
    (0 min) Automating machine learning has achieved remarkable technological developments in recent years, and building an automated machine learning pipeline is now an essential task. The model ensemble is the technique of combining multiple models to get a better and more robust model. However, existing automated machine learning tends to be simplistic in handling the model ensemble, where the ensemble strategy is fixed, such as stacked generalization. There have been many techniques on different ensemble methods, especially ensemble selection, and the fixed ensemble strategy limits the upper limit of the model's performance. In this article, we present a novel framework for automated machine learning. Our framework incorporates advances in dynamic ensemble selection, and to our best knowledge, our approach is the first in the field of AutoML to search and optimize ensemble strategies. In the comparison experiments, our method outperforms the state-of-the-art automated machine learning frameworks with the same CPU time in 42 classification datasets from the OpenML platform. Ablation experiments on our framework validate the effectiveness of our proposed method.
    Characterizing Speech Adversarial Examples Using Self-Attention U-Net Enhancement. (arXiv:2003.13917v2 [eess.AS] UPDATED)
    (0 min) Recent studies have highlighted adversarial examples as ubiquitous threats to the deep neural network (DNN) based speech recognition systems. In this work, we present a U-Net based attention model, U-Net$_{At}$, to enhance adversarial speech signals. Specifically, we evaluate the model performance by interpretable speech recognition metrics and discuss the model performance by the augmented adversarial training. Our experiments show that our proposed U-Net$_{At}$ improves the perceptual evaluation of speech quality (PESQ) from 1.13 to 2.78, speech transmission index (STI) from 0.65 to 0.75, short-term objective intelligibility (STOI) from 0.83 to 0.96 on the task of speech enhancement with adversarial speech examples. We conduct experiments on the automatic speech recognition (ASR) task with adversarial audio attacks. We find that (i) temporal features learned by the attention network are capable of enhancing the robustness of DNN based ASR models; (ii) the generalization power of DNN based ASR model could be enhanced by applying adversarial training with an additive adversarial data augmentation. The ASR metric on word-error-rates (WERs) shows that there is an absolute 2.22 $\%$ decrease under gradient-based perturbation, and an absolute 2.03 $\%$ decrease, under evolutionary-optimized perturbation, which suggests that our enhancement models with adversarial training can further secure a resilient ASR system.
    A Near-Optimal Algorithm for Debiasing Trained Machine Learning Models. (arXiv:2106.12887v2 [cs.LG] UPDATED)
    (0 min) We present a scalable post-processing algorithm for debiasing trained models, including deep neural networks (DNNs), which we prove to be near-optimal by bounding its excess Bayes risk. We empirically validate its advantages on standard benchmark datasets across both classical algorithms as well as modern DNN architectures and demonstrate that it outperforms previous post-processing methods while performing on par with in-processing. In addition, we show that the proposed algorithm is particularly effective for models trained at scale where post-processing is a natural and practical choice.
    DDPG car-following model with real-world human driving experience in CARLA. (arXiv:2112.14602v1 [cs.RO] CROSS LISTED)
    (0 min) In the autonomous driving field, the fusion of human knowledge into Deep Reinforcement Learning (DRL) is often based on the human demonstration recorded in the simulated environment. This limits the generalization and the feasibility of application in real-world traffic. We proposed a two-stage DRL method, that learns from real-world human driving to achieve performance that is superior to the pure DRL agent. Training a DRL agent is done within a framework for CARLA with Robot Operating System (ROS). For evaluation, we designed different real-world driving scenarios to compare the proposed two-stage DRL agent with the pure DRL agent. After extracting the 'good' behavior from the human driver, such as anticipation in a signalized intersection, the agent becomes more efficient and drives safer, which makes this autonomous agent more adapt to Human-Robot Interaction (HRI) traffic.
    An Embedding of ReLU Networks and an Analysis of their Identifiability. (arXiv:2107.09370v2 [cs.LG] UPDATED)
    (0 min) Neural networks with the Rectified Linear Unit (ReLU) nonlinearity are described by a vector of parameters $\theta$, and realized as a piecewise linear continuous function $R_{\theta}: x \in \mathbb R^{d} \mapsto R_{\theta}(x) \in \mathbb R^{k}$. Natural scalings and permutations operations on the parameters $\theta$ leave the realization unchanged, leading to equivalence classes of parameters that yield the same realization. These considerations in turn lead to the notion of identifiability -- the ability to recover (the equivalence class of) $\theta$ from the sole knowledge of its realization $R_{\theta}$. The overall objective of this paper is to introduce an embedding for ReLU neural networks of any depth, $\Phi(\theta)$, that is invariant to scalings and that provides a locally linear parameterization of the realization of the network. Leveraging these two key properties, we derive some conditions under which a deep ReLU network is indeed locally identifiable from the knowledge of the realization on a finite set of samples $x_{i} \in \mathbb R^{d}$. We study the shallow case in more depth, establishing necessary and sufficient conditions for the network to be identifiable from a bounded subset $\mathcal X \subseteq \mathbb R^{d}$.
    The Flip Side of the Reweighted Coin: Duality of Adaptive Dropout and Regularization. (arXiv:2106.07769v3 [cs.LG] UPDATED)
    (0 min) Among the most successful methods for sparsifying deep (neural) networks are those that adaptively mask the network weights throughout training. By examining this masking, or dropout, in the linear case, we uncover a duality between such adaptive methods and regularization through the so-called "$\eta$-trick" that casts both as iteratively reweighted optimizations. We show that any dropout strategy that adapts to the weights in a monotonic way corresponds to an effective subquadratic regularization penalty, and therefore leads to sparse solutions. We obtain the effective penalties for several popular sparsification strategies, which are remarkably similar to classical penalties commonly used in sparse optimization. Considering variational dropout as a case study, we demonstrate similar empirical behavior between the adaptive dropout method and classical methods on the task of deep network sparsification, validating our theory.
    AHAR: Adaptive CNN for Energy-efficient Human Activity Recognition in Low-power Edge Devices. (arXiv:2102.01875v3 [cs.LG] UPDATED)
    (0 min) Human Activity Recognition (HAR) is one of the key applications of health monitoring that requires continuous use of wearable devices to track daily activities. This paper proposes an Adaptive CNN for energy-efficient HAR (AHAR) suitable for low-power edge devices. Unlike traditional early exit architecture that makes the exit decision based on classification confidence, AHAR proposes a novel adaptive architecture that uses an output block predictor to select a portion of the baseline architecture to use during the inference phase. Experimental results show that traditional early exit architectures suffer from performance loss whereas our adaptive architecture provides similar or better performance as the baseline one while being energy-efficient. We validate our methodology in classifying locomotion activities from two datasets- Opportunity and w-HAR. Compared to the fog/cloud computing approaches for the Opportunity dataset, our baseline and adaptive architecture shows a comparable weighted F1 score of 91.79%, and 91.57%, respectively. For the w-HAR dataset, our baseline and adaptive architecture outperforms the state-of-the-art works with a weighted F1 score of 97.55%, and 97.64%, respectively. Evaluation on real hardware shows that our baseline architecture is significantly energy-efficient (422.38x less) and memory-efficient (14.29x less) compared to the works on the Opportunity dataset. For the w-HAR dataset, our baseline architecture requires 2.04x less energy and 2.18x less memory compared to the state-of-the-art work. Moreover, experimental results show that our adaptive architecture is 12.32% (Opportunity) and 11.14% (w-HAR) energy-efficient than our baseline while providing similar (Opportunity) or better (w-HAR) performance with no significant memory overhead.
    DeepSight: Mitigating Backdoor Attacks in Federated Learning Through Deep Model Inspection. (arXiv:2201.00763v1 [cs.CR])
    (0 min) Federated Learning (FL) allows multiple clients to collaboratively train a Neural Network (NN) model on their private data without revealing the data. Recently, several targeted poisoning attacks against FL have been introduced. These attacks inject a backdoor into the resulting model that allows adversary-controlled inputs to be misclassified. Existing countermeasures against backdoor attacks are inefficient and often merely aim to exclude deviating models from the aggregation. However, this approach also removes benign models of clients with deviating data distributions, causing the aggregated model to perform poorly for such clients. To address this problem, we propose DeepSight, a novel model filtering approach for mitigating backdoor attacks. It is based on three novel techniques that allow to characterize the distribution of data used to train model updates and seek to measure fine-grained differences in the internal structure and outputs of NNs. Using these techniques, DeepSight can identify suspicious model updates. We also develop a scheme that can accurately cluster model updates. Combining the results of both components, DeepSight is able to identify and eliminate model clusters containing poisoned models with high attack impact. We also show that the backdoor contributions of possibly undetected poisoned models can be effectively mitigated with existing weight clipping-based defenses. We evaluate the performance and effectiveness of DeepSight and show that it can mitigate state-of-the-art backdoor attacks with a negligible impact on the model's performance on benign data.
    Class-Incremental Continual Learning into the eXtended DER-verse. (arXiv:2201.00766v1 [cs.LG])
    (0 min) The staple of human intelligence is the capability of acquiring knowledge in a continuous fashion. In stark contrast, Deep Networks forget catastrophically and, for this reason, the sub-field of Class-Incremental Continual Learning fosters methods that learn a sequence of tasks incrementally, blending sequentially-gained knowledge into a comprehensive prediction. This work aims at assessing and overcoming the pitfalls of our previous proposal Dark Experience Replay (DER), a simple and effective approach that combines rehearsal and Knowledge Distillation. Inspired by the way our minds constantly rewrite past recollections and set expectations for the future, we endow our model with the abilities to i) revise its replay memory to welcome novel information regarding past data ii) pave the way for learning yet unseen classes. We show that the application of these strategies leads to remarkable improvements; indeed, the resulting method - termed eXtended-DER (X-DER) - outperforms the state of the art on both standard benchmarks (such as CIFAR-100 and miniImagenet) and a novel one here introduced. To gain a better understanding, we further provide extensive ablation studies that corroborate and extend the findings of our previous research (e.g. the value of Knowledge Distillation and flatter minima in continual learning setups).
    Optimization-Based Algebraic Multigrid Coarsening Using Reinforcement Learning. (arXiv:2106.01854v2 [cs.LG] UPDATED)
    (0 min) Large sparse linear systems of equations are ubiquitous in science and engineering, such as those arising from discretizations of partial differential equations. Algebraic multigrid (AMG) methods are one of the most common methods of solving such linear systems, with an extensive body of underlying mathematical theory. A system of linear equations defines a graph on the set of unknowns and each level of a multigrid solver requires the selection of an appropriate coarse graph along with restriction and interpolation operators that map to and from the coarse representation. The efficiency of the multigrid solver depends critically on this selection and many selection methods have been developed over the years. Recently, it has been demonstrated that it is possible to directly learn the AMG interpolation and restriction operators, given a coarse graph selection. In this paper, we consider the complementary problem of learning to coarsen graphs for a multigrid solver, a necessary step in developing fully learnable AMG methods. We propose a method using a reinforcement learning (RL) agent based on graph neural networks (GNNs), which can learn to perform graph coarsening on small planar training graphs and then be applied to unstructured large planar graphs, assuming bounded node degree. We demonstrate that this method can produce better coarse graphs than existing algorithms, even as the graph size increases and other properties of the graph are varied. We also propose an efficient inference procedure for performing graph coarsening that results in linear time complexity in graph size.
    A Comprehensive Survey on Radio Frequency (RF) Fingerprinting: Traditional Approaches, Deep Learning, and Open Challenges. (arXiv:2201.00680v1 [cs.LG])
    (0 min) Fifth generation (5G) networks and beyond envisions massive Internet of Things (IoT) rollout to support disruptive applications such as extended reality (XR), augmented/virtual reality (AR/VR), industrial automation, autonomous driving, and smart everything which brings together massive and diverse IoT devices occupying the radio frequency (RF) spectrum. Along with spectrum crunch and throughput challenges, such a massive scale of wireless devices exposes unprecedented threat surfaces. RF fingerprinting is heralded as a candidate technology that can be combined with cryptographic and zero-trust security measures to ensure data privacy, confidentiality, and integrity in wireless networks. Motivated by the relevance of this subject in the future communication networks, in this work, we present a comprehensive survey of RF fingerprinting approaches ranging from a traditional view to the most recent deep learning (DL) based algorithms. Existing surveys have mostly focused on a constrained presentation of the wireless fingerprinting approaches, however, many aspects remain untold. In this work, however, we mitigate this by addressing every aspect - background on signal intelligence (SIGINT), applications, relevant DL algorithms, systematic literature review of RF fingerprinting techniques spanning the past two decades, discussion on datasets, and potential research avenues - necessary to elucidate this topic to the reader in an encyclopedic manner.
    Improved Topic modeling in Twitter through Community Pooling. (arXiv:2201.00690v1 [cs.IR])
    (0 min) Social networks play a fundamental role in propagation of information and news. Characterizing the content of the messages becomes vital for different tasks, like breaking news detection, personalized message recommendation, fake users detection, information flow characterization and others. However, Twitter posts are short and often less coherent than other text documents, which makes it challenging to apply text mining algorithms to these datasets efficiently. Tweet-pooling (aggregating tweets into longer documents) has been shown to improve automatic topic decomposition, but the performance achieved in this task varies depending on the pooling method. In this paper, we propose a new pooling scheme for topic modeling in Twitter, which groups tweets whose authors belong to the same community (group of users who mainly interact with each other but not with other groups) on a user interaction graph. We present a complete evaluation of this methodology, state of the art schemes and previous pooling models in terms of the cluster quality, document retrieval tasks performance and supervised machine learning classification score. Results show that our Community polling method outperformed other methods on the majority of metrics in two heterogeneous datasets, while also reducing the running time. This is useful when dealing with big amounts of noisy and short user-generated social media texts. Overall, our findings contribute to an improved methodology for identifying the latent topics in a Twitter dataset, without the need of modifying the basic machinery of a topic decomposition model.
    Learning cortical representations through perturbed and adversarial dreaming. (arXiv:2109.04261v2 [q-bio.NC] UPDATED)
    (0 min) Humans and other animals learn to extract general concepts from sensory experience without extensive teaching. This ability is thought to be facilitated by offline states like sleep where previous experiences are systemically replayed. However, the characteristic creative nature of dreams suggests that learning semantic representations may go beyond merely replaying previous experiences. We support this hypothesis by implementing a cortical architecture inspired by generative adversarial networks (GANs). Learning in our model is organized across three different global brain states mimicking wakefulness, NREM and REM sleep, optimizing different, but complementary objective functions. We train the model on standard datasets of natural images and evaluate the quality of the learned representations. Our results suggest that generating new, virtual sensory inputs via adversarial dreaming during REM sleep is essential for extracting semantic concepts, while replaying episodic memories via perturbed dreaming during NREM sleep improves the robustness of latent representations. The model provides a new computational perspective on sleep states, memory replay and dreams and suggests a cortical implementation of GANs.
    Game of GANs: Game-Theoretical Models for Generative Adversarial Networks. (arXiv:2106.06976v3 [cs.LG] UPDATED)
    (0 min) Generative Adversarial Networks (GANs) have recently attracted considerable attention in the AI community due to its ability to generate high-quality data of significant statistical resemblance to real data. Fundamentally, GAN is a game between two neural networks trained in an adversarial manner to reach a zero-sum Nash equilibrium profile. Despite the improvement accomplished in GANs in the last few years, several issues remain to be solved. This paper reviews the literature on the game theoretic aspects of GANs and addresses how game theory models can address specific challenges of generative model and improve the GAN's performance. We first present some preliminaries, including the basic GAN model and some game theory background. We then present taxonomy to classify state-of-the-art solutions into three main categories: modified game models, modified architectures, and modified learning methods. The classification is based on modifications made to the basic GAN model by proposed game-theoretic approaches in the literature. We then explore the objectives of each category and discuss recent works in each category. Finally, we discuss the remaining challenges in this field and present future research directions.
    Machine learning approaches for localized lockdown during COVID-19: a case study analysis. (arXiv:2201.00715v1 [cs.LG])
    (0 min) At the end of 2019, the latest novel coronavirus Sars-CoV-2 emerged as a significant acute respiratory disease that has become a global pandemic. Countries like Brazil have had difficulty in dealing with the virus due to the high socioeconomic difference of states and municipalities. Therefore, this study presents a new approach using different machine learning and deep learning algorithms applied to Brazilian COVID-19 data. First, a clustering algorithm is used to identify counties with similar sociodemographic behavior, while Benford's law is used to check for data manipulation. Based on these results we are able to correctly model SARIMA models based on the clusters to predict new daily cases. The unsupervised machine learning techniques optimized the process of defining the parameters of the SARIMA model. This framework can also be useful to propose confinement scenarios during the so-called second wave. We have used the 645 counties from S\~ao Paulo state, the most populous state in Brazil. However, this methodology can be used in other states or countries. This paper demonstrates how different techniques of machine learning, deep learning, data mining and statistics can be used together to produce important results when dealing with pandemic data. Although the findings cannot be used exclusively to assess and influence policy decisions, they offer an alternative to the ineffective measures that have been used.
    Improving Actor-Critic Reinforcement Learning via Hamiltonian Monte Carlo Method. (arXiv:2103.12020v3 [cs.LG] UPDATED)
    (0 min) The actor-critic RL is widely used in various robotic control tasks. By viewing the actor-critic RL from the perspective of variational inference (VI), the policy network is trained to obtain the approximate posterior of actions given the optimality criteria. However, in practice, the actor-critic RL may yield suboptimal policy estimates due to the amortization gap and insufficient exploration. In this work, inspired by the previous use of Hamiltonian Monte Carlo (HMC) in VI, we propose to integrate the policy network of actor-critic RL with HMC, which is termed as {\it Hamiltonian Policy}. As such we propose to evolve actions from the base policy according to HMC, and our proposed method has many benefits. First, HMC can improve the policy distribution to better approximate the posterior and hence reduce the amortization gap. Second, HMC can also guide the exploration more to the regions of action spaces with higher Q values, enhancing the exploration efficiency. Further, instead of directly applying HMC into RL, we propose a new leapfrog operator to simulate the Hamiltonian dynamics. Finally, in safe RL problems, we find that the proposed method can not only improve the achieved return, but also reduce safety constraint violations by discarding potentially unsafe actions. With comprehensive empirical experiments on continuous control baselines, including MuJoCo and PyBullet Roboschool, we show that the proposed approach is a data-efficient and easy-to-implement improvement over previous actor-critic methods.
    An Efficient and Accurate Rough Set for Feature Selection, Classification and Knowledge Representation. (arXiv:2201.00436v1 [cs.LG])
    (0 min) This paper present a strong data mining method based on rough set, which can realize feature selection, classification and knowledge representation at the same time. Rough set has good interpretability, and is a popular method for feature selections. But low efficiency and low accuracy are its main drawbacks that limits its application ability. In this paper,corresponding to the accuracy, we first find the ineffectiveness of rough set because of overfitting, especially in processing noise attribute, and propose a robust measurement for an attribute, called relative importance.we proposed the concept of "rough concept tree" for knowledge representation and classification. Experimental results on public benchmark data sets show that the proposed framework achieves higher accurcy than seven popular or the state-of-the-art feature selection methods.
    Optimal Sampling Gaps for Adaptive Submodular Maximization. (arXiv:2104.01750v4 [cs.LG] UPDATED)
    (0 min) Running machine learning algorithms on large and rapidly growing volumes of data is often computationally expensive, one common trick to reduce the size of a data set, and thus reduce the computational cost of machine learning algorithms, is \emph{probability sampling}. It creates a sampled data set by including each data point from the original data set with a known probability. Although the benefit of running machine learning algorithms on the reduced data set is obvious, one major concern is that the performance of the solution obtained from samples might be much worse than that of the optimal solution when using the full data set. In this paper, we examine the performance loss caused by probability sampling in the context of adaptive submodular maximization. We consider a simple probability sampling method which selects each data point with probability $r\in[0,1]$. If we set the sampling rate $r=1$, our problem reduces to finding a solution based on the original full data set. We define sampling gap as the largest ratio between the optimal solution obtained from the full data set and the optimal solution obtained from the samples, over independence systems. %It captures the performance loss of the optimal solution caused by the probability sampling. Our main contribution is to show that if the utility function is policywise submodular, then for a given sampling rate $r$, the sampling gap is both upper bounded and lower bounded by $1/r$. One immediate implication of our result is that if we can find an $\alpha$-approximation solution based on a sampled data set (which is sampled at sampling rate $r$), then this solution achieves an $\alpha r$ approximation ratio against the optimal solution when using the full data set.
    An Ensemble of Pre-trained Transformer Models For Imbalanced Multiclass Malware Classification. (arXiv:2112.13236v2 [cs.CR] UPDATED)
    (0 min) Classification of malware families is crucial for a comprehensive understanding of how they can infect devices, computers, or systems. Thus, malware identification enables security researchers and incident responders to take precautions against malware and accelerate mitigation. API call sequences made by malware are widely utilized features by machine and deep learning models for malware classification as these sequences represent the behavior of malware. However, traditional machine and deep learning models remain incapable of capturing sequence relationships between API calls. On the other hand, the transformer-based models process sequences as a whole and learn relationships between API calls due to multi-head attention mechanisms and positional embeddings. Our experiments demonstrate that the transformer model with one transformer block layer surpassed the widely used base architecture, LSTM. Moreover, BERT or CANINE, pre-trained transformer models, outperformed in classifying highly imbalanced malware families according to evaluation metrics, F1-score, and AUC score. Furthermore, the proposed bagging-based random transformer forest (RTF), an ensemble of BERT or CANINE, has reached the state-of-the-art evaluation scores on three out of four datasets, particularly state-of-the-art F1-score of 0.6149 on one of the commonly used benchmark dataset.
    Logic Shrinkage: Learned FPGA Netlist Sparsity for Efficient Neural Network Inference. (arXiv:2112.02346v2 [cs.LG] UPDATED)
    (0 min) FPGA-specific DNN architectures using the native LUTs as independently trainable inference operators have been shown to achieve favorable area-accuracy and energy-accuracy tradeoffs. The first work in this area, LUTNet, exhibited state-of-the-art performance for standard DNN benchmarks. In this paper, we propose the learned optimization of such LUT-based topologies, resulting in higher-efficiency designs than via the direct use of off-the-shelf, hand-designed networks. Existing implementations of this class of architecture require the manual specification of the number of inputs per LUT, K. Choosing appropriate K a priori is challenging, and doing so at even high granularity, e.g. per layer, is a time-consuming and error-prone process that leaves FPGAs' spatial flexibility underexploited. Furthermore, prior works see LUT inputs connected randomly, which does not guarantee a good choice of network topology. To address these issues, we propose logic shrinkage, a fine-grained netlist pruning methodology enabling K to be automatically learned for every LUT in a neural network targeted for FPGA inference. By removing LUT inputs determined to be of low importance, our method increases the efficiency of the resultant accelerators. Our GPU-friendly solution to LUT input removal is capable of processing large topologies during their training with negligible slowdown. With logic shrinkage, we better the area and energy efficiency of the best-performing LUTNet implementation of the CNV network classifying CIFAR-10 by 1.54x and 1.31x, respectively, while matching its accuracy. This implementation also reaches 2.71x the area efficiency of an equally accurate, heavily pruned BNN. On ImageNet with the Bi-Real Net architecture, employment of logic shrinkage results in a post-synthesis area reduction of 2.67x vs LUTNet, allowing for implementation that was previously impossible on today's largest FPGAs.
    Automatic Pharma News Categorization. (arXiv:2201.00688v1 [cs.IR])
    (0 min) We use a text dataset consisting of 23 news categories relevant to pharma information science, in order to compare the fine-tuning performance of multiple transformer models in a classification task. Using a well-balanced dataset with multiple autoregressive and autocoding transformation models, we compare their fine-tuning performance. To validate the winning approach, we perform diagnostics of model behavior on mispredicted instances, including inspection of category-wise metrics, evaluation of prediction certainty and assessment of latent space representations. Lastly, we propose an ensemble model consisting of the top performing individual predictors and demonstrate that this approach offers a modest improvement in the F1 metric.
    Conditional Selective Inference for Robust Regression and Outlier Detection using Piecewise-Linear Homotopy Continuation. (arXiv:2104.10840v2 [stat.ML] UPDATED)
    (0 min) In practical data analysis under noisy environment, it is common to first use robust methods to identify outliers, and then to conduct further analysis after removing the outliers. In this paper, we consider statistical inference of the model estimated after outliers are removed, which can be interpreted as a selective inference (SI) problem. To use conditional SI framework, it is necessary to characterize the events of how the robust method identifies outliers. Unfortunately, the existing methods cannot be directly used here because they are applicable to the case where the selection events can be represented by linear/quadratic constraints. In this paper, we propose a conditional SI method for popular robust regressions by using homotopy method. We show that the proposed conditional SI method is applicable to a wide class of robust regression and outlier detection methods and has good empirical performance on both synthetic data and real data experiments.
    Trends in Integration of Vision and Language Research: A Survey of Tasks, Datasets, and Methods. (arXiv:1907.09358v3 [cs.CV] UPDATED)
    (0 min) Interest in Artificial Intelligence (AI) and its applications has seen unprecedented growth in the last few years. This success can be partly attributed to the advancements made in the sub-fields of AI such as machine learning, computer vision, and natural language processing. Much of the growth in these fields has been made possible with deep learning, a sub-area of machine learning that uses artificial neural networks. This has created significant interest in the integration of vision and language. In this survey, we focus on ten prominent tasks that integrate language and vision by discussing their problem formulation, methods, existing datasets, evaluation measures, and compare the results obtained with corresponding state-of-the-art methods. Our efforts go beyond earlier surveys which are either task-specific or concentrate only on one type of visual content, i.e., image or video. Furthermore, we also provide some potential future directions in this field of research with an anticipation that this survey stimulates innovative thoughts and ideas to address the existing challenges and build new applications.
    Scalable semi-supervised dimensionality reduction with GPU-accelerated EmbedSOM. (arXiv:2201.00701v1 [cs.LG])
    (0 min) Dimensionality reduction methods have found vast application as visualization tools in diverse areas of science. Although many different methods exist, their performance is often insufficient for providing quick insight into many contemporary datasets, and the unsupervised mode of use prevents the users from utilizing the methods for dataset exploration and fine-tuning the details for improved visualization quality. We present BlosSOM, a high-performance semi-supervised dimensionality reduction software for interactive user-steerable visualization of high-dimensional datasets with millions of individual data points. BlosSOM builds on a GPU-accelerated implementation of the EmbedSOM algorithm, complemented by several landmark-based algorithms for interfacing the unsupervised model learning algorithms with the user supervision. We show the application of BlosSOM on realistic datasets, where it helps to produce high-quality visualizations that incorporate user-specified layout and focus on certain features. We believe the semi-supervised dimensionality reduction will improve the data visualization possibilities for science areas such as single-cell cytometry, and provide a fast and efficient base methodology for new directions in dataset exploration and annotation.
    A Family of Deep Learning Architectures for Channel Estimation and Hybrid Beamforming in Multi-Carrier mm-Wave Massive MIMO. (arXiv:1912.10036v6 [eess.SP] UPDATED)
    (0 min) Hybrid analog and digital beamforming transceivers are instrumental in addressing the challenge of expensive hardware and high training overheads in the next generation millimeter-wave (mm-Wave) massive MIMO (multiple-input multiple-output) systems. However, lack of fully digital beamforming in hybrid architectures and short coherence times at mm-Wave impose additional constraints on the channel estimation. Prior works on addressing these challenges have focused largely on narrowband channels wherein optimization-based or greedy algorithms were employed to derive hybrid beamformers. In this paper, we introduce a deep learning (DL) approach for channel estimation and hybrid beamforming for frequency-selective, wideband mm-Wave systems. In particular, we consider a massive MIMO Orthogonal Frequency Division Multiplexing (MIMO-OFDM) system and propose three different DL frameworks comprising convolutional neural networks (CNNs), which accept the raw data of received signal as input and yield channel estimates and the hybrid beamformers at the output. We also introduce both offline and online prediction schemes. Numerical experiments demonstrate that, compared to the current state-of-the-art optimization and DL methods, our approach provides higher spectral efficiency, lesser computational cost and fewer number of pilot signals, and higher tolerance against the deviations in the received pilot data, corrupted channel matrix, and propagation environment.
    Convergence of Random Reshuffling Under The Kurdyka-{\L}ojasiewicz Inequality. (arXiv:2110.04926v2 [math.OC] UPDATED)
    (0 min) We study the random reshuffling (RR) method for smooth nonconvex optimization problems with a finite-sum structure. Though this method is widely utilized in practice such as the training of neural networks, its convergence behavior is only understood in several limited settings. In this paper, under the well-known Kurdyka-Lojasiewicz (KL) inequality, we establish strong limit-point convergence results for RR with appropriate diminishing step sizes, namely, the whole sequence of iterates generated by RR is convergent and converges to a single stationary point in an almost sure sense. In addition, we derive the corresponding rate of convergence, depending on the KL exponent and the suitably selected diminishing step sizes. When the KL exponent lies in $[0,\frac12]$, the convergence is at a rate of $\mathcal{O}(t^{-1})$ with $t$ counting the iteration number. When the KL exponent belongs to $(\frac12,1)$, our derived convergence rate is of the form $\mathcal{O}(t^{-q})$ with $q\in (0,1)$ depending on the KL exponent. The standard KL inequality-based convergence analysis framework only applies to algorithms with a certain descent property. We conduct a novel convergence analysis for the non-descent RR method with diminishing step sizes based on the KL inequality, which generalizes the standard KL framework. We summarize our main steps and core ideas in an informal analysis framework, which is of independent interest. As a direct application of this framework, we also establish similar strong limit-point convergence results for the reshuffled proximal point method.
    Boosting Contrastive Self-Supervised Learning with False Negative Cancellation. (arXiv:2011.11765v2 [cs.CV] UPDATED)
    (0 min) Self-supervised representation learning has made significant leaps fueled by progress in contrastive learning, which seeks to learn transformations that embed positive input pairs nearby, while pushing negative pairs far apart. While positive pairs can be generated reliably (e.g., as different views of the same image), it is difficult to accurately establish negative pairs, defined as samples from different images regardless of their semantic content or visual features. A fundamental problem in contrastive learning is mitigating the effects of false negatives. Contrasting false negatives induces two critical issues in representation learning: discarding semantic information and slow convergence. In this paper, we propose novel approaches to identify false negatives, as well as two strategies to mitigate their effect, i.e. false negative elimination and attraction, while systematically performing rigorous evaluations to study this problem in detail. Our method exhibits consistent improvements over existing contrastive learning-based methods. Without labels, we identify false negatives with 40% accuracy among 1000 semantic classes on ImageNet, and achieve 5.8% absolute improvement in top-1 accuracy over the previous state-of-the-art when finetuning with 1% labels. Our code is available at https://github.com/google-research/fnc.
    Deep Learning Interviews: Hundreds of fully solved job interview questions from a wide range of key topics in AI. (arXiv:2201.00650v1 [cs.LG])
    (0 min) The second edition of Deep Learning Interviews is home to hundreds of fully-solved problems, from a wide range of key topics in AI. It is designed to both rehearse interview or exam specific topics and provide machine learning M.Sc./Ph.D. students, and those awaiting an interview a well-organized overview of the field. The problems it poses are tough enough to cut your teeth on and to dramatically improve your skills-but they're framed within thought-provoking questions and engaging stories. That is what makes the volume so specifically valuable to students and job seekers: it provides them with the ability to speak confidently and quickly on any relevant topic, to answer technical questions clearly and correctly, and to fully understand the purpose and meaning of interview questions and answers. Those are powerful, indispensable advantages to have when walking into the interview room. The book's contents is a large inventory of numerous topics relevant to DL job interviews and graduate level exams. That places this work at the forefront of the growing trend in science to teach a core set of practical mathematical and computational skills. It is widely accepted that the training of every computer scientist must include the fundamental theorems of ML, and AI appears in the curriculum of nearly every university. This volume is designed as an excellent reference for graduates of such programs.
    Boosting the Certified Robustness of L-infinity Distance Nets. (arXiv:2110.06850v3 [cs.LG] UPDATED)
    (0 min) Recently, Zhang et al.(2021) developed a new neural network architecture based on $\ell_\infty$-distance functions, which naturally possesses certified $\ell_\infty$ robustness by its construction. Despite rigorous theoretical guarantees, the model so far can only achieve comparable performance to conventional networks. In this paper, we make the following two contributions: $\mathrm{(i)}$ We demonstrate that $\ell_\infty$-distance nets enjoy a fundamental advantage in certified robustness over conventional networks (under typical certification approaches); $\mathrm{(ii)}$ With an improved training process we are able to significantly boost the certified accuracy of $\ell_\infty$-distance nets. Our training approach largely alleviates the optimization problem that arose in the previous training scheme, in particular, the unexpected large Lipschitz constant due to the use of a crucial trick called $\ell_p$-relaxation. The core of our training approach is a novel objective function that combines scaled cross-entropy loss and clipped hinge loss with a decaying mixing coefficient. Experiments show that using the proposed training strategy, the certified accuracy of $\ell_\infty$-distance net can be dramatically improved from 33.30% to 40.06% on CIFAR-10 ($\epsilon=8/255$), meanwhile outperforming other approaches in this area by a large margin. Our results clearly demonstrate the effectiveness and potential of $\ell_\infty$-distance net for certified robustness. Codes are available at https://github.com/zbh2047/L_inf-dist-net-v2.
    Resource-Efficient and Delay-Aware Federated Learning Design under Edge Heterogeneity. (arXiv:2112.13926v2 [cs.NI] UPDATED)
    (0 min) Federated learning (FL) has emerged as a popular methodology for distributing machine learning across wireless edge devices. In this work, we consider optimizing the tradeoff between model performance and resource utilization in FL, under device-server communication delays and device computation heterogeneity. Our proposed StoFedDelAv algorithm incorporates a local-global model combiner into the FL synchronization step. We theoretically characterize the convergence behavior of StoFedDelAv and obtain the optimal combiner weights, which consider the global model delay and expected local gradient error at each device. We then formulate a network-aware optimization problem which tunes the minibatch sizes of the devices to jointly minimize energy consumption and machine learning training loss, and solve the non-convex problem through a series of convex approximations. Our simulations reveal that StoFedDelAv outperforms the current art in FL in terms of model convergence speed and network resource utilization when the minibatch size and the combiner weights are adjusted. Additionally, our method can reduce the number of uplink communication rounds required during the model training period to reach the same accuracy.
    AugMax: Adversarial Composition of Random Augmentations for Robust Training. (arXiv:2110.13771v3 [cs.CV] UPDATED)
    (0 min) Data augmentation is a simple yet effective way to improve the robustness of deep neural networks (DNNs). Diversity and hardness are two complementary dimensions of data augmentation to achieve robustness. For example, AugMix explores random compositions of a diverse set of augmentations to enhance broader coverage, while adversarial training generates adversarially hard samples to spot the weakness. Motivated by this, we propose a data augmentation framework, termed AugMax, to unify the two aspects of diversity and hardness. AugMax first randomly samples multiple augmentation operators and then learns an adversarial mixture of the selected operators. Being a stronger form of data augmentation, AugMax leads to a significantly augmented input distribution which makes model training more challenging. To solve this problem, we further design a disentangled normalization module, termed DuBIN (Dual-Batch-and-Instance Normalization), that disentangles the instance-wise feature heterogeneity arising from AugMax. Experiments show that AugMax-DuBIN leads to significantly improved out-of-distribution robustness, outperforming prior arts by 3.03%, 3.49%, 1.82% and 0.71% on CIFAR10-C, CIFAR100-C, Tiny ImageNet-C and ImageNet-C. Codes and pretrained models are available: https://github.com/VITA-Group/AugMax.
    Parkour Spot ID: Feature Matching in Satellite and Street view images using Deep Learning. (arXiv:2201.00377v1 [cs.CV])
    (0 min) How to find places that are not indexed by Google Maps? We propose an intuitive method and framework to locate places based on their distinctive spatial features. The method uses satellite and street view images in machine vision approaches to classify locations. If we can classify locations, we just need to repeat for non-overlapping locations in our area of interest. We assess the proposed system in finding Parkour spots in the campus of Arizona State University. The results are very satisfactory, having found more than 25 new Parkour spots, with a rate of true positives above 60%.
    Model-free Neural Counterfactual Regret Minimization with Bootstrap Learning. (arXiv:2012.01870v3 [cs.LG] UPDATED)
    (0 min) Counterfactual Regret Minimization (CFR) has achieved many fascinating results in solving large-scale Imperfect Information Games (IIGs). Neural network approximation CFR (neural CFR) is one of the promising techniques that can reduce computation and memory consumption by generalizing decision information between similar states. Current neural CFR algorithms have to approximate cumulative regrets. However, efficient and accurate approximation in a large-scale IIG is still a tough challenge. In this paper, a new CFR variant, Recursive CFR (ReCFR), is proposed. In ReCFR, Recursive Substitute Values (RSVs) are learned and used to replace cumulative regrets. It is proven that ReCFR can converge to a Nash equilibrium at a rate of $O(\frac{1}{\sqrt{T}})$. Based on ReCFR, a new model-free neural CFR with bootstrap learning, Neural ReCFR-B, is proposed. Due to the recursive and non-cumulative nature of RSVs, Neural ReCFR-B has lower-variance training targets than other neural CFRs. Experimental results show that Neural ReCFR-B is competitive with the state-of-the-art neural CFR algorithms at a much lower training cost.
    MHATC: Autism Spectrum Disorder identification utilizing multi-head attention encoder along with temporal consolidation modules. (arXiv:2201.00404v1 [q-bio.NC])
    (0 min) Resting-state fMRI is commonly used for diagnosing Autism Spectrum Disorder (ASD) by using network-based functional connectivity. It has been shown that ASD is associated with brain regions and their inter-connections. However, discriminating based on connectivity patterns among imaging data of the control population and that of ASD patients' brains is a non-trivial task. In order to tackle said classification task, we propose a novel deep learning architecture (MHATC) consisting of multi-head attention and temporal consolidation modules for classifying an individual as a patient of ASD. The devised architecture results from an in-depth analysis of the limitations of current deep neural network solutions for similar applications. Our approach is not only robust but computationally efficient, which can allow its adoption in a variety of other research and clinical settings.
    MORAL: Aligning AI with Human Norms through Multi-Objective Reinforced Active Learning. (arXiv:2201.00012v1 [cs.LG])
    (0 min) Inferring reward functions from demonstrations and pairwise preferences are auspicious approaches for aligning Reinforcement Learning (RL) agents with human intentions. However, state-of-the art methods typically focus on learning a single reward model, thus rendering it difficult to trade off different reward functions from multiple experts. We propose Multi-Objective Reinforced Active Learning (MORAL), a novel method for combining diverse demonstrations of social norms into a Pareto-optimal policy. Through maintaining a distribution over scalarization weights, our approach is able to interactively tune a deep RL agent towards a variety of preferences, while eliminating the need for computing multiple policies. We empirically demonstrate the effectiveness of MORAL in two scenarios, which model a delivery and an emergency task that require an agent to act in the presence of normative conflicts. Overall, we consider our research a step towards multi-objective RL with learned rewards, bridging the gap between current reward learning and machine ethics literature.
    maskGRU: Tracking Small Objects in the Presence of Large Background Motions. (arXiv:2201.00467v1 [cs.CV])
    (0 min) We propose a recurrent neural network-based spatio-temporal framework named maskGRU for the detection and tracking of small objects in videos. While there have been many developments in the area of object tracking in recent years, tracking a small moving object amid other moving objects and actors (such as a ball amid moving players in sports footage) continues to be a difficult task. Existing spatio-temporal networks, such as convolutional Gated Recurrent Units (convGRUs), are difficult to train and have trouble accurately tracking small objects under such conditions. To overcome these difficulties, we developed the maskGRU framework that uses a weighted sum of the internal hidden state produced by a convGRU and a 3-channel mask of the tracked object's predicted bounding box as the hidden state to be used at the next time step of the underlying convGRU. We believe the technique of incorporating a mask into the hidden state through a weighted sum has two benefits: controlling the effect of exploding gradients and introducing an attention-like mechanism into the network by indicating where in the previous video frame the object is located. Our experiments show that maskGRU outperforms convGRU at tracking objects that are small relative to the video resolution even in the presence of other moving objects.
    Transformer Embeddings of Irregularly Spaced Events and Their Participants. (arXiv:2201.00044v1 [cs.LG])
    (0 min) We propose an approach to modeling irregularly spaced sequences of discrete events. We begin with a continuous-time variant of the Transformer, which was originally formulated (Vaswani et al., 2017) for sequences without timestamps. We embed a possible event (or other boolean fact) at time $t$ by using attention over the events that occurred at times $< t$ (and the facts that were true when they occurred). We control this attention using pattern-matching logic rules that relate events and facts that share participants. These rules determine which previous events will be attended to, as well as how to transform the embeddings of the events and facts into the attentional queries, keys, and values. Other logic rules describe how to change the set of facts in response to events. Our approach closely follows Mei et al. (2020a), and adopts their Datalog Through Time formalism for logic rules. As in that work, a domain expert first writes a set of logic rules that establishes the set of possible events and other facts at each time $t$. Each possible event or other fact is embedded using a neural architecture that is derived from the rules that established it. Our only difference from Mei et al. (2020a) is that we derive a flatter, attention-based neural architecture whereas they used a more serial LSTM architecture. We find that our attention-based approach performs about equally well on the RoboCup dataset, where the logic rules play an important role in improving performance. We also compared these two methods with two previous attention-based methods (Zuo et al., 2020; Zhang et al., 2020a) on simpler synthetic and real domains without logic rules, and found our proposed approach to be at least as good, and sometimes better, than each of the other three methods.
    The DONUT Approach to EnsembleCombination Forecasting. (arXiv:2201.00426v1 [cs.LG])
    (0 min) This paper presents an ensemble forecasting method that shows strong results on the M4Competition dataset by decreasing feature and model selection assumptions, termed DONUT(DO Not UTilize human assumptions). Our assumption reductions, consisting mainly of auto-generated features and a more diverse model pool for the ensemble, significantly outperforms the statistical-feature-based ensemble method FFORMA by Montero-Manso et al. (2020). Furthermore, we investigate feature extraction with a Long short-term memory Network(LSTM) Autoencoder and find that such features contain crucial information not captured by traditional statistical feature approaches. The ensemble weighting model uses both LSTM features and statistical features to combine the models accurately. Analysis of feature importance and interaction show a slight superiority for LSTM features over the statistical ones alone. Clustering analysis shows that different essential LSTM features are different from most statistical features and each other. We also find that increasing the solution space of the weighting model by augmenting the ensemble with new models is something the weighting model learns to use, explaining part of the accuracy gains. Lastly, we present a formal ex-post-facto analysis of optimal combination and selection for ensembles, quantifying differences through linear optimization on the M4 dataset. We also include a short proof that model combination is superior to model selection, a posteriori.
    Have I done enough planning or should I plan more?. (arXiv:2201.00764v1 [cs.AI])
    (0 min) People's decisions about how to allocate their limited computational resources are essential to human intelligence. An important component of this metacognitive ability is deciding whether to continue thinking about what to do and move on to the next decision. Here, we show that people acquire this ability through learning and reverse-engineer the underlying learning mechanisms. Using a process-tracing paradigm that externalises human planning, we find that people quickly adapt how much planning they perform to the cost and benefit of planning. To discover the underlying metacognitive learning mechanisms we augmented a set of reinforcement learning models with metacognitive features and performed Bayesian model selection. Our results suggest that the metacognitive ability to adjust the amount of planning might be learned through a policy-gradient mechanism that is guided by metacognitive pseudo-rewards that communicate the value of planning.
    Confidence-Aware Multi-Teacher Knowledge Distillation. (arXiv:2201.00007v1 [cs.LG])
    (0 min) Knowledge distillation is initially introduced to utilize additional supervision from a single teacher model for the student model training. To boost the student performance, some recent variants attempt to exploit diverse knowledge sources from multiple teachers. However, existing studies mainly integrate knowledge from diverse sources by averaging over multiple teacher predictions or combining them using other various label-free strategies, which may mislead student in the presence of low-quality teacher predictions. To tackle this problem, we propose Confidence-Aware Multi-teacher Knowledge Distillation (CA-MKD), which adaptively assigns sample-wise reliability for each teacher prediction with the help of ground-truth labels, with those teacher predictions close to one-hot labels assigned large weights. Besides, CA-MKD incorporates intermediate layers to further improve student performance. Extensive experiments show that our CA-MKD consistently outperforms all compared state-of-the-art methods across various teacher-student architectures.
    MLOps -- Definitions, Tools and Challenges. (arXiv:2201.00162v1 [cs.LG])
    (0 min) This paper is an overview of the Machine Learning Operations (MLOps) area. Our aim is to define the operation and the components of such systems by highlighting the current problems and trends. In this context, we present the different tools and their usefulness in order to provide the corresponding guidelines. Moreover, the connection between MLOps and AutoML (Automated Machine Learning) is identified and how this combination could work is proposed.
    PowerGraph: Using neural networks and principal components to multivariate statistical power trade-offs. (arXiv:2201.00719v1 [stat.ME])
    (0 min) It is increasingly acknowledged that a priori statistical power estimation for planned studies with multiple model parameters is inherently a multivariate problem. Power for individual parameters of interest cannot be reliably estimated univariately because sampling variably in, correlation with, and variance explained relative to one parameter will impact the power for another parameter, all usual univariate considerations being equal. Explicit solutions in such cases, especially for models with many parameters, are either impractical or impossible to solve, leaving researchers with the prevailing method of simulating power. However, point estimates for a vector of model parameters are uncertain, and the impact of inaccuracy is unknown. In such cases, sensitivity analysis is recommended such that multiple combinations of possible observable parameter vectors are simulated to understand power trade-offs. A limitation to this approach is that it is computationally expensive to generate sufficient sensitivity combinations to accurately map the power trade-off function in increasingly high dimensional spaces for the models that social scientists estimate. This paper explores the efficient estimation and graphing of statistical power for a study over varying model parameter combinations. Optimally powering a study is crucial to ensure a minimum probability of finding the hypothesized effect. We first demonstrate the impact of varying parameter values on power for specific hypotheses of interest and quantify the computational intensity of computing such a graph for a given level of precision. Finally, we propose a simple and generalizable machine learning inspired solution to cut the computational cost to less than 7\% of what could be called a brute force approach. [abridged]
    Transfer RL across Observation Feature Spaces via Model-Based Regularization. (arXiv:2201.00248v1 [cs.LG])
    (0 min) In many reinforcement learning (RL) applications, the observation space is specified by human developers and restricted by physical realizations, and may thus be subject to dramatic changes over time (e.g. increased number of observable features). However, when the observation space changes, the previous policy will likely fail due to the mismatch of input features, and another policy must be trained from scratch, which is inefficient in terms of computation and sample complexity. Following theoretical insights, we propose a novel algorithm which extracts the latent-space dynamics in the source task, and transfers the dynamics model to the target task to use as a model-based regularizer. Our algorithm works for drastic changes of observation space (e.g. from vector-based observation to image-based observation), without any inter-task mapping or any prior knowledge of the target task. Empirical results show that our algorithm significantly improves the efficiency and stability of learning in the target task.
    Swift and Sure: Hardness-aware Contrastive Learning for Low-dimensional Knowledge Graph Embeddings. (arXiv:2201.00565v1 [cs.LG])
    (0 min) Knowledge graph embedding (KGE) has drawn great attention due to its potential in automatic knowledge graph (KG) completion and knowledge-driven tasks. However, recent KGE models suffer from high training cost and large storage space, thus limiting their practicality in real-world applications. To address this challenge, based on the latest findings in the field of Contrastive Learning, we propose a novel KGE training framework called Hardness-aware Low-dimensional Embedding (HaLE). Instead of the traditional Negative Sampling, we design a new loss function based on query sampling that can balance two important training targets, Alignment and Uniformity. Furthermore, we analyze the hardness-aware ability of recent low-dimensional hyperbolic models and propose a lightweight hardness-aware activation mechanism, which can help the KGE models focus on hard instances and speed up convergence. The experimental results show that in the limited training time, HaLE can effectively improve the performance and training speed of KGE models on five commonly-used datasets. The HaLE-trained models can obtain a high prediction accuracy after training few minutes and are competitive compared to the state-of-the-art models in both low- and high-dimensional conditions.
    Lung-Originated Tumor Segmentation from Computed Tomography Scan (LOTUS) Benchmark. (arXiv:2201.00458v1 [eess.IV])
    (0 min) Lung cancer is one of the deadliest cancers, and in part its effective diagnosis and treatment depend on the accurate delineation of the tumor. Human-centered segmentation, which is currently the most common approach, is subject to inter-observer variability, and is also time-consuming, considering the fact that only experts are capable of providing annotations. Automatic and semi-automatic tumor segmentation methods have recently shown promising results. However, as different researchers have validated their algorithms using various datasets and performance metrics, reliably evaluating these methods is still an open challenge. The goal of the Lung-Originated Tumor Segmentation from Computed Tomography Scan (LOTUS) Benchmark created through 2018 IEEE Video and Image Processing (VIP) Cup competition, is to provide a unique dataset and pre-defined metrics, so that different researchers can develop and evaluate their methods in a unified fashion. The 2018 VIP Cup started with a global engagement from 42 countries to access the competition data. At the registration stage, there were 129 members clustered into 28 teams from 10 countries, out of which 9 teams made it to the final stage and 6 teams successfully completed all the required tasks. In a nutshell, all the algorithms proposed during the competition, are based on deep learning models combined with a false positive reduction technique. Methods developed by the three finalists show promising results in tumor segmentation, however, more effort should be put into reducing the false positive rate. This competition manuscript presents an overview of the VIP-Cup challenge, along with the proposed algorithms and results.
    Neural network training under semidefinite constraints. (arXiv:2201.00632v1 [cs.LG])
    (0 min) This paper is concerned with the training of neural networks (NNs) under semidefinite constraints. This type of training problems has recently gained popularity since semidefinite constraints can be used to verify interesting properties for NNs that include, e.g., the estimation of an upper bound on the Lipschitz constant, which relates to the robustness of an NN, or the stability of dynamic systems with NN controllers. The utilized semidefinite constraints are based on sector constraints satisfied by the underlying activation functions. Unfortunately, one of the biggest bottlenecks of these new results is the required computational effort for incorporating the semidefinite constraints into the training of NNs which is limiting their scalability to large NNs. We address this challenge by developing interior point methods for NN training that we implement using barrier functions for semidefinite constraints. In order to efficiently compute the gradients of the barrier terms, we exploit the structure of the semidefinite constraints. In experiments, we demonstrate the superior efficiency of our training method over previous approaches, which allows us, e.g., to use semidefinite constraints in the training of Wasserstein generative adversarial networks, where the discriminator must satisfy a Lipschitz condition.
    An Efficient Combinatorial Optimization Model Using Learning-to-Rank Distillation. (arXiv:2201.00695v1 [cs.IR])
    (0 min) Recently, deep reinforcement learning (RL) has proven its feasibility in solving combinatorial optimization problems (COPs). The learning-to-rank techniques have been studied in the field of information retrieval. While several COPs can be formulated as the prioritization of input items, as is common in the information retrieval, it has not been fully explored how the learning-to-rank techniques can be incorporated into deep RL for COPs. In this paper, we present the learning-to-rank distillation-based COP framework, where a high-performance ranking policy obtained by RL for a COP can be distilled into a non-iterative, simple model, thereby achieving a low-latency COP solver. Specifically, we employ the approximated ranking distillation to render a score-based ranking model learnable via gradient descent. Furthermore, we use the efficient sequence sampling to improve the inference performance with a limited delay. With the framework, we demonstrate that a distilled model not only achieves comparable performance to its respective, high-performance RL, but also provides several times faster inferences. We evaluate the framework with several COPs such as priority-based task scheduling and multidimensional knapsack, demonstrating the benefits of the framework in terms of inference latency and performance.
    A Lightweight and Accurate Spatial-Temporal Transformer for Traffic Forecasting. (arXiv:2201.00008v1 [cs.LG])
    (0 min) We study the forecasting problem for traffic with dynamic, possibly periodical, and joint spatial-temporal dependency between regions. Given the aggregated inflow and outflow traffic of regions in a city from time slots 0 to t-1, we predict the traffic at time t at any region. Prior arts in the area often consider the spatial and temporal dependencies in a decoupled manner or are rather computationally intensive in training with a large number of hyper-parameters to tune. We propose ST-TIS, a novel, lightweight, and accurate Spatial-Temporal Transformer with information fusion and region sampling for traffic forecasting. ST-TIS extends the canonical Transformer with information fusion and region sampling. The information fusion module captures the complex spatial-temporal dependency between regions. The region sampling module is to improve the efficiency and prediction accuracy, cutting the computation complexity for dependency learning from $O(n^2)$ to $O(n\sqrt{n})$, where n is the number of regions. With far fewer parameters than state-of-the-art models, the offline training of our model is significantly faster in terms of tuning and computation (with a reduction of up to $90\%$ on training time and network parameters). Notwithstanding such training efficiency, extensive experiments show that ST-TIS is substantially more accurate in online prediction than state-of-the-art approaches (with an average improvement of up to $11\%$ on RMSE, $14\%$ on MAPE).
    Improving Out-of-Distribution Robustness via Selective Augmentation. (arXiv:2201.00299v1 [cs.LG])
    (0 min) Machine learning algorithms typically assume that training and test examples are drawn from the same distribution. However, distribution shift is a common problem in real-world applications and can cause models to perform dramatically worse at test time. In this paper, we specifically consider the problems of domain shifts and subpopulation shifts (eg. imbalanced data). While prior works often seek to explicitly regularize internal representations and predictors of the model to be domain invariant, we instead aim to regularize the whole function without restricting the model's internal representations. This leads to a simple mixup-based technique which learns invariant functions via selective augmentation called LISA. LISA selectively interpolates samples either with the same labels but different domains or with the same domain but different labels. We analyze a linear setting and theoretically show how LISA leads to a smaller worst-group error. Empirically, we study the effectiveness of LISA on nine benchmarks ranging from subpopulation shifts to domain shifts, and we find that LISA consistently outperforms other state-of-the-art methods.
    Deep-learning-based upscaling method for geologic models via theory-guided convolutional neural network. (arXiv:2201.00698v1 [cs.LG])
    (0 min) Large-scale or high-resolution geologic models usually comprise a huge number of grid blocks, which can be computationally demanding and time-consuming to solve with numerical simulators. Therefore, it is advantageous to upscale geologic models (e.g., hydraulic conductivity) from fine-scale (high-resolution grids) to coarse-scale systems. Numerical upscaling methods have been proven to be effective and robust for coarsening geologic models, but their efficiency remains to be improved. In this work, a deep-learning-based method is proposed to upscale the fine-scale geologic models, which can assist to improve upscaling efficiency significantly. In the deep learning method, a deep convolutional neural network (CNN) is trained to approximate the relationship between the coarse grid of hydraulic conductivity fields and the hydraulic heads, which can then be utilized to replace the numerical solvers while solving the flow equations for each coarse block. In addition, physical laws (e.g., governing equations and periodic boundary conditions) can also be incorporated into the training process of the deep CNN model, which is termed the theory-guided convolutional neural network (TgCNN). With the physical information considered, dependence on the data volume of training the deep learning models can be reduced greatly. Several subsurface flow cases are introduced to test the performance of the proposed deep-learning-based upscaling method, including 2D and 3D cases, and isotropic and anisotropic cases. The results show that the deep learning method can provide equivalent upscaling accuracy to the numerical method, and efficiency can be improved significantly compared to numerical upscaling.
    Representation Topology Divergence: A Method for Comparing Neural Network Representations. (arXiv:2201.00058v1 [cs.LG])
    (0 min) Comparison of data representations is a complex multi-aspect problem that has not enjoyed a complete solution yet. We propose a method for comparing two data representations. We introduce the Representation Topology Divergence (RTD), measuring the dissimilarity in multi-scale topology between two point clouds of equal size with a one-to-one correspondence between points. The data point clouds are allowed to lie in different ambient spaces. The RTD is one of the few TDA-based practical methods applicable to real machine learning datasets. Experiments show that the proposed RTD agrees with the intuitive assessment of data representation similarity and is sensitive to its topological structure. We apply RTD to gain insights on neural networks representations in computer vision and NLP domains for various problems: training dynamics analysis, data distribution shift, transfer learning, ensemble learning, disentanglement assessment.
    Faster Unbalanced Optimal Transport: Translation invariant Sinkhorn and 1-D Frank-Wolfe. (arXiv:2201.00730v1 [math.OC])
    (0 min) Unbalanced optimal transport (UOT) extends optimal transport (OT) to take into account mass variations to compare distributions. This is crucial to make OT successful in ML applications, making it robust to data normalization and outliers. The baseline algorithm is Sinkhorn, but its convergence speed might be significantly slower for UOT than for OT. In this work, we identify the cause for this deficiency, namely the lack of a global normalization of the iterates, which equivalently corresponds to a translation of the dual OT potentials. Our first contribution leverages this idea to develop a provably accelerated Sinkhorn algorithm (coined 'translation invariant Sinkhorn') for UOT, bridging the computational gap with OT. Our second contribution focusses on 1-D UOT and proposes a Frank-Wolfe solver applied to this translation invariant formulation. The linear oracle of each steps amounts to solving a 1-D OT problems, resulting in a linear time complexity per iteration. Our last contribution extends this method to the computation of UOT barycenter of 1-D measures. Numerical simulations showcase the convergence speed improvement brought by these three approaches.
    Hands-on Bayesian Neural Networks -- a Tutorial for Deep Learning Users. (arXiv:2007.06823v3 [cs.LG] UPDATED)
    (0 min) Modern deep learning methods constitute incredibly powerful tools to tackle a myriad of challenging problems. However, since deep learning methods operate as black boxes, the uncertainty associated with their predictions is often challenging to quantify. Bayesian statistics offer a formalism to understand and quantify the uncertainty associated with deep neural network predictions. This tutorial provides an overview of the relevant literature and a complete toolset to design, implement, train, use and evaluate Bayesian Neural Networks, i.e. Stochastic Artificial Neural Networks trained using Bayesian methods.
    Linear Classifiers in Product Space Forms. (arXiv:2102.10204v3 [cs.LG] UPDATED)
    (0 min) Embedding methods for product spaces are powerful techniques for low-distortion and low-dimensional representation of complex data structures. Here, we address the new problem of linear classification in product space forms -- products of Euclidean, spherical, and hyperbolic spaces. First, we describe novel formulations for linear classifiers on a Riemannian manifold using geodesics and Riemannian metrics which generalize straight lines and inner products in vector spaces. Second, we prove that linear classifiers in $d$-dimensional space forms of any curvature have the same expressive power, i.e., they can shatter exactly $d+1$ points. Third, we formalize linear classifiers in product space forms, describe the first known perceptron and support vector machine classifiers for such spaces and establish rigorous convergence results for perceptrons. Moreover, we prove that the Vapnik-Chervonenkis dimension of linear classifiers in a product space form of dimension $d$ is \emph{at least} $d+1$. We support our theoretical findings with simulations on several datasets, including synthetic data, image data, and single-cell RNA sequencing (scRNA-seq) data. The results show that classification in low-dimensional product space forms for scRNA-seq data offers, on average, a performance improvement of $\sim15\%$ when compared to that in Euclidean spaces of the same dimension.
    Self-attention Multi-view Representation Learning with Diversity-promoting Complementarity. (arXiv:2201.00168v1 [cs.LG])
    (0 min) Multi-view learning attempts to generate a model with a better performance by exploiting the consensus and/or complementarity among multi-view data. However, in terms of complementarity, most existing approaches only can find representations with single complementarity rather than complementary information with diversity. In this paper, to utilize both complementarity and consistency simultaneously, give free rein to the potential of deep learning in grasping diversity-promoting complementarity for multi-view representation learning, we propose a novel supervised multi-view representation learning algorithm, called Self-Attention Multi-View network with Diversity-Promoting Complementarity (SAMVDPC), which exploits the consistency by a group of encoders, uses self-attention to find complementary information entailing diversity. Extensive experiments conducted on eight real-world datasets have demonstrated the effectiveness of our proposed method, and show its superiority over several baseline methods, which only consider single complementary information.
    Layer-Wise Multi-View Decoding for Improved Natural Language Generation. (arXiv:2005.08081v6 [cs.CL] UPDATED)
    (0 min) In sequence-to-sequence learning, e.g., natural language generation, the decoder relies on the attention mechanism to efficiently extract information from the encoder. While it is common practice to draw information from only the last encoder layer, recent work has proposed to use representations from different encoder layers for diversified levels of information. Nonetheless, the decoder still obtains only a single view of the source sequences, which might lead to insufficient training of the encoder layer stack due to the hierarchy bypassing problem. In this work, we propose layer-wise multi-view decoding, where for each decoder layer, together with the representations from the last encoder layer, which serve as a global view, those from other encoder layers are supplemented for a stereoscopic view of the source sequences. Systematic experiments and analyses show that we successfully address the hierarchy bypassing problem, require almost negligible parameter increase, and substantially improve the performance of sequence-to-sequence learning with deep representations on five diverse tasks, i.e., machine translation, abstractive summarization, image captioning, video captioning, and medical report generation. In particular, our approach achieves new state-of-the-art results on eight benchmark datasets, including a low-resource machine translation dataset and two low-resource medical report generation datasets.
    Descriptors for Machine Learning Model of Generalized Force Field in Condensed Matter Systems. (arXiv:2201.00798v1 [cond-mat.str-el])
    (0 min) We outline the general framework of machine learning (ML) methods for multi-scale dynamical modeling of condensed matter systems, and in particular of strongly correlated electron models. Complex spatial temporal behaviors in these systems often arise from the interplay between quasi-particles and the emergent dynamical classical degrees of freedom, such as local lattice distortions, spins, and order-parameters. Central to the proposed framework is the ML energy model that, by successfully emulating the time-consuming electronic structure calculation, can accurately predict a local energy based on the classical field in the intermediate neighborhood. In order to properly include the symmetry of the electron Hamiltonian, a crucial component of the ML energy model is the descriptor that transforms the neighborhood configuration into invariant feature variables, which are input to the learning model. A general theory of the descriptor for the classical fields is formulated, and two types of models are distinguished depending on the presence or absence of an internal symmetry for the classical field. Several specific approaches to the descriptor of the classical fields are presented. Our focus is on the group-theoretical method that offers a systematic and rigorous approach to compute invariants based on the bispectrum coefficients. We propose an efficient implementation of the bispectrum method based on the concept of reference irreducible representations. Finally, the implementations of the various descriptors are demonstrated on well-known electronic lattice models.
    FIFA ranking: Evaluation and path forward. (arXiv:2201.00691v1 [cs.IR])
    (0 min) In this work we study the ranking algorithm used by F\'ed\'eration Internationale de Football Association (FIFA); we analyze the parameters it currently uses, show the formal probabilistic model from which it can be derived, and optimize the latter. In particular, analyzing the games since the introduction of the algorithm in 2018, we conclude that the game's "importance" (as defined by FIFA) used in the algorithm is counterproductive from the point of view of the predictive capability of the algorithm. We also postulate the algorithm to be rooted in the formal modelling principle, where the Davidson model proposed in 1970 seems to be an excellent candidate, preserving the form of the algorithm currently used. The results indicate that the predictive capability of the algorithm is notably improved by using the home-field advantage and the explicit model for the draws in the game. Moderate, but notable improvement may be attained by introducing the weighting of the results with the goal differential, which although not rooted in a formal modelling principle, is compatible with the current algorithm and can be tuned to the characteristics of the football competition.
    Toward the Analysis of Graph Neural Networks. (arXiv:2201.00115v1 [cs.SE])
    (0 min) Graph Neural Networks (GNNs) have recently emerged as a robust framework for graph-structured data. They have been applied to many problems such as knowledge graph analysis, social networks recommendation, and even Covid19 detection and vaccine developments. However, unlike other deep neural networks such as Feed Forward Neural Networks (FFNNs), few analyses such as verification and property inferences exist, potentially due to dynamic behaviors of GNNs, which can take arbitrary graphs as input, whereas FFNNs which only take fixed size numerical vectors as inputs. This paper proposes an approach to analyze GNNs by converting them into FFNNs and reusing existing FFNNs analyses. We discuss various designs to ensure the scalability and accuracy of the conversions. We illustrate our method on a study case of node classification. We believe that our approach opens new research directions for understanding and analyzing GNNs.
    HarmoFL: Harmonizing Local and Global Drifts in Federated Learning on Heterogeneous Medical Images. (arXiv:2112.10775v2 [eess.IV] UPDATED)
    (0 min) Multiple medical institutions collaboratively training a model using federated learning (FL) has become a promising solution for maximizing the potential of data-driven models, yet the non-independent and identically distributed (non-iid) data in medical images is still an outstanding challenge in real-world practice. The feature heterogeneity caused by diverse scanners or protocols introduces a drift in the learning process, in both local (client) and global (server) optimizations, which harms the convergence as well as model performance. Many previous works have attempted to address the non-iid issue by tackling the drift locally or globally, but how to jointly solve the two essentially coupled drifts is still unclear. In this work, we concentrate on handling both local and global drifts and introduce a new harmonizing framework called HarmoFL. First, we propose to mitigate the local update drift by normalizing amplitudes of images transformed into the frequency domain to mimic a unified imaging setting, in order to generate a harmonized feature space across local clients. Second, based on harmonized features, we design a client weight perturbation guiding each local model to reach a flat optimum, where a neighborhood area of the local optimal solution has a uniformly low loss. Without any extra communication cost, the perturbation assists the global model to optimize towards a converged optimal solution by aggregating several local flat optima. We have theoretically analyzed the proposed method and empirically conducted extensive experiments on three medical image classification and segmentation tasks, showing that HarmoFL outperforms a set of recent state-of-the-art methods with promising convergence behavior. Code is available at https://github.com/med-air/HarmoFL.
    FamilySeer: Towards Optimized Tensor Codes by Exploiting Computation Subgraph Similarity. (arXiv:2201.00194v1 [cs.LG])
    (0 min) Deploying various deep learning (DL) models efficiently has boosted the research on DL compilers. The difficulty of generating optimized tensor codes drives DL compiler to ask for the auto-tuning approaches, and the increasing demands require increasing auto-tuning efficiency and quality. Currently, the DL compilers partition the input DL models into several subgraphs and leverage the auto-tuning to find the optimal tensor codes of these subgraphs. However, existing auto-tuning approaches usually regard subgraphs as individual ones and overlook the similarities across them, and thus fail to exploit better tensor codes under limited time budgets. We propose FamilySeer, an auto-tuning framework for DL compilers that can generate better tensor codes even with limited time budgets. FamilySeer exploits the similarities and differences among subgraphs can organize them into subgraph families, where the tuning of one subgraph can also improve other subgraphs within the same family. The cost model of each family gets more purified training samples generated by the family and becomes more accurate so that the costly measurements on real hardware can be replaced with the lightweight estimation through cost model. Our experiments show that FamilySeer can generate model codes with the same code performance more efficiently than state-of-the-art auto-tuning frameworks.
    Latent Gaussian Model Boosting. (arXiv:2105.08966v3 [cs.LG] UPDATED)
    (0 min) Latent Gaussian models and boosting are widely used techniques in statistics and machine learning. Tree-boosting shows excellent prediction accuracy on many data sets, but potential drawbacks are that it assumes conditional independence of samples, produces discontinuous predictions for, e.g., spatial data, and it can have difficulty with high-cardinality categorical variables. Latent Gaussian models, such as Gaussian process and grouped random effects models, are flexible prior models which explicitly model dependence among samples and which allow for efficient learning of predictor functions and for making probabilistic predictions. However, existing latent Gaussian models usually assume either a zero or a linear prior mean function which can be an unrealistic assumption. This article introduces a novel approach that combines boosting and latent Gaussian models to remedy the above-mentioned drawbacks and to leverage the advantages of both techniques. We obtain increased prediction accuracy compared to existing approaches in both simulated and real-world data experiments.
    Uncertainty Detection in EEG Neural Decoding Models. (arXiv:2201.00627v1 [eess.SP])
    (0 min) EEG decoding systems based on deep neural networks have been widely used in decision making of brain computer interfaces (BCI). Their predictions, however, can be unreliable given the significant variance and noise in EEG signals. Previous works on EEG analysis mainly focus on the exploration of noise pattern in the source signal, while the uncertainty during the decoding process is largely unexplored. Automatically detecting and quantifying such decoding uncertainty is important for BCI motor imagery applications such as robotic arm control etc. In this work, we proposed an uncertainty estimation model (UE-EEG) to explore the uncertainty during the EEG decoding process, which considers both the uncertainty in the input signal and the uncertainty in the model. The model utilized dropout oriented method for model uncertainty estimation, and Bayesian neural network is adopted for modeling the uncertainty of input data. The model can be integrated into current widely used deep learning classifiers without change of architecture. We performed extensive experiments for uncertainty estimation in both intra-subject EEG decoding and cross-subject EEG decoding on two public motor imagery datasets, where the proposed model achieves significant improvement on the quality of estimated uncertainty and demonstrates the proposed UE-EEG is a useful tool for BCI applications.
    Deep Feature Fusion for Mitosis Counting. (arXiv:2002.03781v2 [cs.CV] UPDATED)
    (0 min) Each woman living in the United States has about 1 in 8 chance of developing invasive breast cancer. The mitotic cell count is one of the most common tests to assess the aggressiveness or grade of breast cancer. In this prognosis, histopathology images must be examined by a pathologist using high-resolution microscopes to count the cells. Unfortunately, can be an exhaustive task with poor reproducibility, especially for non-experts. Deep learning networks have recently been adapted to medical applications which are able to automatically localize these regions of interest. However, these region-based networks lack the ability to take advantage of the segmentation features produced by a full image CNN which are often used as a sole method of detection. Therefore, the proposed method leverages Faster RCNN for object detection while fusing segmentation features generated by a UNet with RGB image features to achieve an F-score of 0.508 on the MITOS-ATYPIA 2014 mitosis counting challenge dataset, outperforming state-of-the-art methods.
    Reinforcement Learning for Task Specifications with Action-Constraints. (arXiv:2201.00286v1 [cs.LG])
    (0 min) In this paper, we use concepts from supervisory control theory of discrete event systems to propose a method to learn optimal control policies for a finite-state Markov Decision Process (MDP) in which (only) certain sequences of actions are deemed unsafe (respectively safe). We assume that the set of action sequences that are deemed unsafe and/or safe are given in terms of a finite-state automaton; and propose a supervisor that disables a subset of actions at every state of the MDP so that the constraints on action sequence are satisfied. Then we present a version of the Q-learning algorithm for learning optimal policies in the presence of non-Markovian action-sequence and state constraints, where we use the development of reward machines to handle the state constraints. We illustrate the method using an example that captures the utility of automata-based methods for non-Markovian state and action specifications for reinforcement learning and show the results of simulations in this setting.
    Reconstructing spectral functions via automatic differentiation. (arXiv:2111.14760v2 [hep-ph] UPDATED)
    (0 min) Reconstructing spectral functions from Euclidean Green's functions is an important inverse problem in many-body physics. However, the inversion is proved to be ill-posed in the realistic systems with noisy Green's functions. In this Letter, we propose an automatic differentiation(AD) framework as a generic tool for the spectral reconstruction from propagator observable. Exploiting the neural networks' regularization as a non-local smoothness regulator of the spectral function, we represent spectral functions by neural networks and use the propagator's reconstruction error to optimize the network parameters unsupervisedly. In the training process, except for the positive-definite form for the spectral function, there are no other explicit physical priors embedded into the neural networks. The reconstruction performance is assessed through relative entropy and mean square error for two different network representations. Compared to the maximum entropy method, the AD framework achieves better performance in the large-noise situation. It is noted that the freedom of introducing non-local regularization is an inherent advantage of the present framework and may lead to substantial improvements in solving inverse problems.
    Self Punishment and Reward Backfill for Deep Q-Learning. (arXiv:2004.05002v2 [cs.AI] UPDATED)
    (0 min) Reinforcement learning agents learn by encouraging behaviours which maximize their total reward, usually provided by the environment. In many environments, however, the reward is provided after a series of actions rather than each single action, leading the agent to experience ambiguity in terms of whether those actions are effective, an issue known as the credit assignment problem. In this paper, we propose two strategies inspired by behavioural psychology to enable the agent to intrinsically estimate more informative reward values for actions with no reward. The first strategy, called self-punishment (SP), discourages the agent from making mistakes that lead to undesirable terminal states. The second strategy, called the rewards backfill (RB), backpropagates the rewards between two rewarded actions. We prove that, under certain assumptions and regardless of the reinforcement learning algorithm used, these two strategies maintain the order of policies in the space of all possible policies in terms of their total reward, and, by extension, maintain the optimal policy. Hence, our proposed strategies integrate with any reinforcement learning algorithm that learns a value or action-value function through experience. We incorporated these two strategies into three popular deep reinforcement learning approaches and evaluated the results on thirty Atari games. After parameter tuning, our results indicate that the proposed strategies improve the tested methods in over 65 percent of tested games by up to over 25 times performance improvement.
    Mind Your Solver! On Adversarial Attack and Defense for Combinatorial Optimization. (arXiv:2201.00402v1 [math.OC])
    (0 min) Combinatorial optimization (CO) is a long-standing challenging task not only in its inherent complexity (e.g. NP-hard) but also the possible sensitivity to input conditions. In this paper, we take an initiative on developing the mechanisms for adversarial attack and defense towards combinatorial optimization solvers, whereby the solver is treated as a black-box function and the original problem's underlying graph structure (which is often available and associated with the problem instance, e.g. DAG, TSP) is attacked under a given budget. In particular, we present a simple yet effective defense strategy to modify the graph structure to increase the robustness of solvers, which shows its universal effectiveness across tasks and solvers.
    A Review of Open-World Learning and Steps Toward Open-World Learning Without Labels. (arXiv:2011.12906v3 [cs.CV] UPDATED)
    (0 min) In open-world learning, an agent starts with a set of known classes, detects, and manages things that it does not know, and learns them over time from a non-stationary stream of data. Open-world learning is related to but also distinct from a multitude of other learning problems and this paper briefly analyzes the key differences between a wide range of problems including incremental learning, generalized novelty discovery, and generalized zero-shot learning. This paper formalizes various open-world learning problems including open-world learning without labels. These open-world problems can be addressed with modifications to known elements, we present a new framework that enables agents to combine various modules for novelty-detection, novelty-characterization, incremental learning, and instance management to learn new classes from a stream of unlabeled data in an unsupervised manner, survey how to adapt a few state-of-the-art techniques to fit the framework and use them to define seven baselines for performance on the open-world learning without labels problem. We then discuss open-world learning quality and analyze how that can improve instance management. We also discuss some of the general ambiguity issues that occur in open-world learning without labels.
    Latent structure blockmodels for Bayesian spectral graph clustering. (arXiv:2107.01734v2 [stat.ML] UPDATED)
    (0 min) Spectral embedding of network adjacency matrices often produces node representations living approximately around low-dimensional submanifold structures. In particular, hidden substructure is expected to arise when the graph is generated from a latent position model. Furthermore, the presence of communities within the network might generate community-specific submanifold structures in the embedding, but this is not explicitly accounted for in most statistical models for networks. In this article, a class of models called latent structure block models (LSBM) is proposed to address such scenarios, allowing for graph clustering when community-specific one dimensional manifold structure is present. LSBMs focus on a specific class of latent space model, the random dot product graph (RDPG), and assign a latent submanifold to the latent positions of each community. A Bayesian model for the embeddings arising from LSBMs is discussed, and shown to have a good performance on simulated and real world network data. The model is able to correctly recover the underlying communities living in a one-dimensional manifold, even when the parametric form of the underlying curves is unknown, achieving remarkable results on a variety of real data.
    Distributed Evolution Strategies Using TPUs for Meta-Learning. (arXiv:2201.00093v1 [cs.NE])
    (0 min) Meta-learning traditionally relies on backpropagation through entire tasks to iteratively improve a model's learning dynamics. However, this approach is computationally intractable when scaled to complex tasks. We propose a distributed evolutionary meta-learning strategy using Tensor Processing Units (TPUs) that is highly parallel and scalable to arbitrarily long tasks with no increase in memory cost. Using a Prototypical Network trained with evolution strategies on the Omniglot dataset, we achieved an accuracy of 98.4% on a 5-shot classification problem. Our algorithm used as much as 40 times less memory than automatic differentiation to compute the gradient, with the resulting model achieving accuracy within 1.3% of a backpropagation-trained equivalent (99.6%). We observed better classification accuracy as high as 99.1% with larger population configurations. We further experimentally validate the stability and performance of ES-ProtoNet across a variety of training conditions (varying population size, model size, number of workers, shot, way, ES hyperparameters, etc.). Our contributions are twofold: we provide the first assessment of evolutionary meta-learning in a supervised setting, and create a general framework for distributed evolution strategies on TPUs.
    Dynamic Persistent Homology for Brain Networks via Wasserstein Graph Clustering. (arXiv:2201.00087v1 [math.AT])
    (0 min) We present the novel Wasserstein graph clustering for dynamically changing graphs. The Wasserstein clustering penalizes the topological discrepancy between graphs. The Wasserstein clustering is shown to outperform the widely used k-means clustering. The method applied in more accurate determination of the state spaces of dynamically changing functional brain networks.
    AutoGCL: Automated Graph Contrastive Learning via Learnable View Generators. (arXiv:2109.10259v2 [cs.LG] UPDATED)
    (0 min) Contrastive learning has been widely applied to graph representation learning, where the view generators play a vital role in generating effective contrastive samples. Most of the existing contrastive learning methods employ pre-defined view generation methods, e.g., node drop or edge perturbation, which usually cannot adapt to input data or preserve the original semantic structures well. To address this issue, we propose a novel framework named Automated Graph Contrastive Learning (AutoGCL) in this paper. Specifically, AutoGCL employs a set of learnable graph view generators orchestrated by an auto augmentation strategy, where every graph view generator learns a probability distribution of graphs conditioned by the input. While the graph view generators in AutoGCL preserve the most representative structures of the original graph in generation of every contrastive sample, the auto augmentation learns policies to introduce adequate augmentation variances in the whole contrastive learning procedure. Furthermore, AutoGCL adopts a joint training strategy to train the learnable view generators, the graph encoder, and the classifier in an end-to-end manner, resulting in topological heterogeneity yet semantic similarity in the generation of contrastive samples. Extensive experiments on semi-supervised learning, unsupervised learning, and transfer learning demonstrate the superiority of our AutoGCL framework over the state-of-the-arts in graph contrastive learning. In addition, the visualization results further confirm that the learnable view generators can deliver more compact and semantically meaningful contrastive samples compared against the existing view generation methods.
    Learning shared neural manifolds from multi-subject FMRI data. (arXiv:2201.00622v1 [q-bio.NC])
    (0 min) Functional magnetic resonance imaging (fMRI) is a notoriously noisy measurement of brain activity because of the large variations between individuals, signals marred by environmental differences during collection, and spatiotemporal averaging required by the measurement resolution. In addition, the data is extremely high dimensional, with the space of the activity typically having much lower intrinsic dimension. In order to understand the connection between stimuli of interest and brain activity, and analyze differences and commonalities between subjects, it becomes important to learn a meaningful embedding of the data that denoises, and reveals its intrinsic structure. Specifically, we assume that while noise varies significantly between individuals, true responses to stimuli will share common, low-dimensional features between subjects which are jointly discoverable. Similar approaches have been exploited previously but they have mainly used linear methods such as PCA and shared response modeling (SRM). In contrast, we propose a neural network called MRMD-AE (manifold-regularized multiple decoder, autoencoder), that learns a common embedding from multiple subjects in an experiment while retaining the ability to decode to individual raw fMRI signals. We show that our learned common space represents an extensible manifold (where new points not seen during training can be mapped), improves the classification accuracy of stimulus features of unseen timepoints, as well as improves cross-subject translation of fMRI signals. We believe this framework can be used for many downstream applications such as guided brain-computer interface (BCI) training in the future.
    On the convex hull of convex quadratic optimization problems with indicators. (arXiv:2201.00387v1 [math.OC])
    (0 min) We consider the convex quadratic optimization problem with indicator variables and arbitrary constraints on the indicators. We show that a convex hull description of the associated mixed-integer set in an extended space with a quadratic number of additional variables consists of a single positive semidefinite constraint (explicitly stated) and linear constraints. In particular, convexification of this class of problems reduces to describing a polyhedral set in an extended formulation. We also give descriptions in the original space of variables: we provide a description based on an infinite number of conic-quadratic inequalities, which are "finitely generated." In particular, it is possible to characterize whether a given inequality is necessary to describe the convex-hull. The new theory presented here unifies several previously established results, and paves the way toward utilizing polyhedral methods to analyze the convex hull of mixed-integer nonlinear sets.
    Applications of Gaussian Mutation for Self Adaptation in Evolutionary Genetic Algorithms. (arXiv:2201.00285v1 [cs.NE])
    (0 min) In recent years, optimization problems have become increasingly more prevalent due to the need for more powerful computational methods. With the more recent advent of technology such as artificial intelligence, new metaheuristics are needed that enhance the capabilities of classical algorithms. More recently, researchers have been looking at Charles Darwin's theory of natural selection and evolution as a means of enhancing current approaches using machine learning. In 1960, the first genetic algorithm was developed by John H. Holland and his student. We explore the mathematical intuition of the genetic algorithm in developing systems capable of evolving using Gaussian mutation, as well as its implications in solving optimization problems.
    Robust Natural Language Processing: Recent Advances, Challenges, and Future Directions. (arXiv:2201.00768v1 [cs.CL])
    (0 min) Recent natural language processing (NLP) techniques have accomplished high performance on benchmark datasets, primarily due to the significant improvement in the performance of deep learning. The advances in the research community have led to great enhancements in state-of-the-art production systems for NLP tasks, such as virtual assistants, speech recognition, and sentiment analysis. However, such NLP systems still often fail when tested with adversarial attacks. The initial lack of robustness exposed troubling gaps in current models' language understanding capabilities, creating problems when NLP systems are deployed in real life. In this paper, we present a structured overview of NLP robustness research by summarizing the literature in a systemic way across various dimensions. We then take a deep-dive into the various dimensions of robustness, across techniques, metrics, embeddings, and benchmarks. Finally, we argue that robustness should be multi-dimensional, provide insights into current research, identify gaps in the literature to suggest directions worth pursuing to address these gaps.
    Statistical and Topological Properties of Sliced Probability Divergences. (arXiv:2003.05783v2 [stat.ML] UPDATED)
    (0 min) The idea of slicing divergences has been proven to be successful when comparing two probability measures in various machine learning applications including generative modeling, and consists in computing the expected value of a `base divergence' between one-dimensional random projections of the two measures. However, the topological, statistical, and computational consequences of this technique have not yet been well-established. In this paper, we aim at bridging this gap and derive various theoretical properties of sliced probability divergences. First, we show that slicing preserves the metric axioms and the weak continuity of the divergence, implying that the sliced divergence will share similar topological properties. We then precise the results in the case where the base divergence belongs to the class of integral probability metrics. On the other hand, we establish that, under mild conditions, the sample complexity of a sliced divergence does not depend on the problem dimension. We finally apply our general results to several base divergences, and illustrate our theory on both synthetic and real data experiments.
    Elsa: Energy-based learning for semi-supervised anomaly detection. (arXiv:2103.15296v2 [cs.CV] UPDATED)
    (0 min) Anomaly detection aims at identifying deviant instances from the normal data distribution. Many advances have been made in the field, including the innovative use of unsupervised contrastive learning. However, existing methods generally assume clean training data and are limited when the data contain unknown anomalies. This paper presents Elsa, a novel semi-supervised anomaly detection approach that unifies the concept of energy-based models with unsupervised contrastive learning. Elsa instills robustness against any data contamination by a carefully designed fine-tuning step based on the new energy function that forces the normal data to be divided into classes of prototypes. Experiments on multiple contamination scenarios show the proposed model achieves SOTA performance. Extensive analyses also verify the contribution of each component in the proposed model. Beyond the experiments, we also offer a theoretical interpretation of why contrastive learning alone cannot detect anomalies under data contamination.
    Risk Bounds for Over-parameterized Maximum Margin Classification on Sub-Gaussian Mixtures. (arXiv:2104.13628v2 [cs.LG] UPDATED)
    (0 min) Modern machine learning systems such as deep neural networks are often highly over-parameterized so that they can fit the noisy training data exactly, yet they can still achieve small test errors in practice. In this paper, we study this "benign overfitting" phenomenon of the maximum margin classifier for linear classification problems. Specifically, we consider data generated from sub-Gaussian mixtures, and provide a tight risk bound for the maximum margin linear classifier in the over-parameterized setting. Our results precisely characterize the condition under which benign overfitting can occur in linear classification problems, and improve on previous work. They also have direct implications for over-parameterized logistic regression.
    Topic Analysis of Superconductivity Literature by Semantic Non-negative Matrix Factorization. (arXiv:2201.00687v1 [cs.DL])
    (0 min) We utilize a recently developed topic modeling method called SeNMFk, extending the standard Non-negative Matrix Factorization (NMF) methods by incorporating the semantic structure of the text, and adding a robust system for determining the number of topics. With SeNMFk, we were able to extract coherent topics validated by human experts. From these topics, a few are relatively general and cover broad concepts, while the majority can be precisely mapped to specific scientific effects or measurement techniques. The topics also differ by ubiquity, with only three topics prevalent in almost 40 percent of the abstract, while each specific topic tends to dominate a small subset of the abstracts. These results demonstrate the ability of SeNMFk to produce a layered and nuanced analysis of large scientific corpora.
    Using Non-Stationary Bandits for Learning in Repeated Cournot Games with Non-Stationary Demand. (arXiv:2201.00486v1 [cs.LG])
    (0 min) Many past attempts at modeling repeated Cournot games assume that demand is stationary. This does not align with real-world scenarios in which market demands can evolve over a product's lifetime for a myriad of reasons. In this paper, we model repeated Cournot games with non-stationary demand such that firms/agents face separate instances of non-stationary multi-armed bandit problem. The set of arms/actions that an agent can choose from represents discrete production quantities; here, the action space is ordered. Agents are independent and autonomous, and cannot observe anything from the environment; they can only see their own rewards after taking an action, and only work towards maximizing these rewards. We propose a novel algorithm 'Adaptive with Weighted Exploration (AWE) $\epsilon$-greedy' which is remotely based on the well-known $\epsilon$-greedy approach. This algorithm detects and quantifies changes in rewards due to varying market demand and varies learning rate and exploration rate in proportion to the degree of changes in demand, thus enabling agents to better identify new optimal actions. For efficient exploration, it also deploys a mechanism for weighing actions that takes advantage of the ordered action space. We use simulations to study the emergence of various equilibria in the market. In addition, we study the scalability of our approach in terms number of total agents in the system and the size of action space. We consider both symmetric and asymmetric firms in our models. We found that using our proposed method, agents are able to swiftly change their course of action according to the changes in demand, and they also engage in collusive behavior in many simulations.
    Fair Data Representation for Machine Learning at the Pareto Frontier. (arXiv:2201.00292v1 [stat.ML])
    (0 min) As machine learning powered decision making is playing an increasingly important role in our daily lives, it is imperative to strive for fairness of the underlying data processing and algorithms. We propose a pre-processing algorithm for fair data representation via which L2- objective supervised learning algorithms result in an estimation of the Pareto frontier between prediction error and statistical disparity. In particular, the present work applies the optimal positive definite affine transport maps to approach the post-processing Wasserstein barycenter characterization of the optimal fair L2-objective supervised learning via a pre-processing data deformation. We call the resulting data Wasserstein pseudo-barycenter. Furthermore, we show that the Wasserstein geodesics from the learning outcome marginals to the barycenter characterizes the Pareto frontier between L2-loss and total Wasserstein distance among learning outcome marginals. Thereby, an application of McCann interpolation generalizes the pseudo-barycenter to a family of data representations via which L2-objective supervised learning algorithms result in the Pareto frontier. Numerical simulations underscore the advantages of the proposed data representation: (1) the pre-processing step is compositive with arbitrary L2-objective supervised learning methods and unseen data; (2) the fair representation protects data privacy by preventing the training machine from direct or indirect access to the sensitive information of the data; (3) the optimal affine map results in efficient computation of fair supervised learning on high-dimensional data; (4) experimental results shed light on the fairness of L2-objective unsupervised learning via the proposed fair data representation.
    Novelty-based Generalization Evaluation for Traffic Light Detection. (arXiv:2201.00531v1 [cs.CV])
    (0 min) The advent of Convolutional Neural Networks (CNNs) has led to their application in several domains. One noteworthy application is the perception system for autonomous driving that relies on the predictions from CNNs. Practitioners evaluate the generalization ability of such CNNs by calculating various metrics on an independent test dataset. A test dataset is often chosen based on only one precondition, i.e., its elements are not a part of the training data. Such a dataset may contain objects that are both similar and novel w.r.t. the training dataset. Nevertheless, existing works do not reckon the novelty of the test samples and treat them all equally for evaluating generalization. Such novelty-based evaluations are of significance to validate the fitness of a CNN in autonomous driving applications. Hence, we propose a CNN generalization scoring framework that considers novelty of objects in the test dataset. We begin with the representation learning technique to reduce the image data into a low-dimensional space. It is on this space we estimate the novelty of the test samples. Finally, we calculate the generalization score as a combination of the test data prediction performance and novelty. We perform an experimental study of the same for our traffic light detection application. In addition, we systematically visualize the results for an interpretable notion of novelty.
    Rethinking Feature Uncertainty in Stochastic Neural Networks for Adversarial Robustness. (arXiv:2201.00148v1 [cs.LG])
    (2 min) It is well-known that deep neural networks (DNNs) have shown remarkable success in many fields. However, when adding an imperceptible magnitude perturbation on the model input, the model performance might get rapid decrease. To address this issue, a randomness technique has been proposed recently, named Stochastic Neural Networks (SNNs). Specifically, SNNs inject randomness into the model to defend against unseen attacks and improve the adversarial robustness. However, existed studies on SNNs mainly focus on injecting fixed or learnable noises to model weights/activations. In this paper, we find that the existed SNNs performances are largely bottlenecked by the feature representation ability. Surprisingly, simply maximizing the variance per dimension of the feature distribution leads to a considerable boost beyond all previous methods, which we named maximize feature distribution variance stochastic neural network (MFDV-SNN). Extensive experiments on well-known white- and black-box attacks show that MFDV-SNN achieves a significant improvement over existing methods, which indicates that it is a simple but effective method to improve model robustness.
    Stochastic Weight Averaging Revisited. (arXiv:2201.00519v1 [cs.LG])
    (0 min) Stochastic weight averaging (SWA) is recognized as a simple while one effective approach to improve the generalization of stochastic gradient descent (SGD) for training deep neural networks (DNNs). A common insight to explain its success is that averaging weights following an SGD process equipped with cyclical or high constant learning rates can discover wider optima, which then lead to better generalization. We give a new insight that does not concur with the above one. We characterize that SWA's performance is highly dependent on to what extent the SGD process that runs before SWA converges, and the operation of weight averaging only contributes to variance reduction. This new insight suggests practical guides on better algorithm design. As an instantiation, we show that following an SGD process with insufficient convergence, running SWA more times leads to continual incremental benefits in terms of generalization. Our findings are corroborated by extensive experiments across different network architectures, including a baseline CNN, PreResNet-164, WideResNet-28-10, VGG16, ResNet-50, ResNet-152, DenseNet-161, and different datasets including CIFAR-{10,100}, and Imagenet.
    Minimum Excess Risk in Bayesian Learning. (arXiv:2012.14868v2 [cs.LG] UPDATED)
    (0 min) We analyze the best achievable performance of Bayesian learning under generative models by defining and upper-bounding the minimum excess risk (MER): the gap between the minimum expected loss attainable by learning from data and the minimum expected loss that could be achieved if the model realization were known. The definition of MER provides a principled way to define different notions of uncertainties in Bayesian learning, including the aleatoric uncertainty and the minimum epistemic uncertainty. Two methods for deriving upper bounds for the MER are presented. The first method, generally suitable for Bayesian learning with a parametric generative model, upper-bounds the MER by the conditional mutual information between the model parameters and the quantity being predicted given the observed data. It allows us to quantify the rate at which the MER decays to zero as more data becomes available. Under realizable models, this method also relates the MER to the richness of the generative function class, notably the VC dimension in binary classification. The second method, particularly suitable for Bayesian learning with a parametric predictive model, relates the MER to the minimum estimation error of the model parameters from data. It explicitly shows how the uncertainty in model parameter estimation translates to the MER and to the final prediction uncertainty. We also extend the definition and analysis of MER to the setting with multiple model families and the setting with nonparametric models. Along the discussions we draw some comparisons between the MER in Bayesian learning and the excess risk in frequentist learning.
    A Mixed Integer Programming Approach to Training Dense Neural Networks. (arXiv:2201.00723v1 [cs.LG])
    (0 min) Artificial Neural Networks (ANNs) are prevalent machine learning models that have been applied across various real world classification tasks. ANNs require a large amount of data to have strong out of sample performance, and many algorithms for training ANN parameters are based on stochastic gradient descent (SGD). However, the SGD ANNs that tend to perform best on prediction tasks are trained in an end to end manner that requires a large number of model parameters and random initialization. This means training ANNs is very time consuming and the resulting models take a lot of memory to deploy. In order to train more parsimonious ANN models, we propose the use of alternative methods from the constrained optimization literature for ANN training and pretraining. In particular, we propose novel mixed integer programming (MIP) formulations for training fully-connected ANNs. Our formulations can account for both binary activation and rectified linear unit (ReLU) activation ANNs, and for the use of a log likelihood loss. We also develop a layer-wise greedy approach, a technique adapted for reducing the number of layers in the ANN, for model pretraining using our MIP formulations. We then present numerical experiments comparing our MIP based methods against existing SGD based approaches and show that we are able to achieve models with competitive out of sample performance that are significantly more parsimonious.
    On Sensitivity of Deep Learning Based Text Classification Algorithms to Practical Input Perturbations. (arXiv:2201.00318v1 [cs.CL])
    (0 min) Text classification is a fundamental Natural Language Processing task that has a wide variety of applications, where deep learning approaches have produced state-of-the-art results. While these models have been heavily criticized for their black-box nature, their robustness to slight perturbations in input text has been a matter of concern. In this work, we carry out a data-focused study evaluating the impact of systematic practical perturbations on the performance of the deep learning based text classification models like CNN, LSTM, and BERT-based algorithms. The perturbations are induced by the addition and removal of unwanted tokens like punctuation and stop-words that are minimally associated with the final performance of the model. We show that these deep learning approaches including BERT are sensitive to such legitimate input perturbations on four standard benchmark datasets SST2, TREC-6, BBC News, and tweet_eval. We observe that BERT is more susceptible to the removal of tokens as compared to the addition of tokens. Moreover, LSTM is slightly more sensitive to input perturbations as compared to CNN based model. The work also serves as a practical guide to assessing the impact of discrepancies in train-test conditions on the final performance of models.
    SAFL: A Self-Attention Scene Text Recognizer with Focal Loss. (arXiv:2201.00132v1 [cs.CV])
    (2 min) In the last decades, scene text recognition has gained worldwide attention from both the academic community and actual users due to its importance in a wide range of applications. Despite achievements in optical character recognition, scene text recognition remains challenging due to inherent problems such as distortions or irregular layout. Most of the existing approaches mainly leverage recurrence or convolution-based neural networks. However, while recurrent neural networks (RNNs) usually suffer from slow training speed due to sequential computation and encounter problems as vanishing gradient or bottleneck, CNN endures a trade-off between complexity and performance. In this paper, we introduce SAFL, a self-attention-based neural network model with the focal loss for scene text recognition, to overcome the limitation of the existing approaches. The use of focal loss instead of negative log-likelihood helps the model focus more on low-frequency samples training. Moreover, to deal with the distortions and irregular texts, we exploit Spatial TransformerNetwork (STN) to rectify text before passing to the recognition network. We perform experiments to compare the performance of the proposed model with seven benchmarks. The numerical results show that our model achieves the best performance.
    Support vector machines and Radon's theorem. (arXiv:2011.00617v2 [cs.LG] UPDATED)
    (2 min) A support vector machine (SVM) is an algorithm which finds a hyperplane that optimally separates labeled data points in $\mathbb{R}^n$ into positive and negative classes. The data points on the margin of this separating hyperplane are called support vectors. We connect the possible configurations of support vectors to Radon's theorem, which provides guarantees for when a set of points can be divided into two classes (positive and negative) whose convex hulls intersect. If the convex hulls of the positive and negative support vectors are projected onto a separating hyperplane, the projections intersect in at least one point if and only if the hyperplane is optimal. Further, with a particular type of general position, we show that (a) the projected convex hulls of the support vectors intersect in exactly one point, (b) the support vectors are stable under perturbation, (c) there are at most $n+1$ support vectors, and (d) every number of support vectors from 2 up to $n+1$ is possible. Finally, we perform computer simulations studying the expected number of support vectors, and their configurations, for randomly generated data. We observe that as the distance between classes of points increases for this type of randomly generated data, configurations with two support vectors become the most likely configurations.
    Continuity of Generalized Entropy and Statistical Learning. (arXiv:2012.15829v2 [cs.LG] UPDATED)
    (2 min) We study the continuity property of the generalized entropy as a function of the underlying probability distribution, defined with an action space and a loss function, and use this property to answer the basic questions in statistical learning theory: the excess risk analyses for various learning methods. We first derive upper and lower bounds for the entropy difference of two distributions in terms of several commonly used f-divergences, the Wasserstein distance, a distance that depends on the action space and the loss function, and the Bregman divergence generated by the entropy, which also induces bounds in terms of the Euclidean distance between the two distributions. Examples are given along with the discussion of each general result, comparisons are made with the existing entropy difference bounds, and new mutual information upper bounds are derived based on the new results. We then apply the entropy difference bounds to the theory of statistical learning. It is shown that the excess risks in the two popular learning paradigms, the frequentist learning and the Bayesian learning, both can be studied with the continuity property of different forms of the generalized entropy. The analysis is then extended to the continuity of generalized conditional entropy. The extension provides performance bounds for Bayes decision making with mismatched distributions. It also leads to excess risk bounds for a third paradigm of learning, where the decision rule is optimally designed under the projection of the empirical distribution to a predefined family of distributions. We thus establish a unified method of excess risk analysis for the three major paradigms of statistical learning, through the continuity of generalized entropy.
    Concept Embeddings for Fuzzy Logic Verification of Deep Neural Networks in Perception Tasks. (arXiv:2201.00572v1 [cs.CV])
    (2 min) One major drawback of deep neural networks (DNNs) for use in sensitive application domains is their black-box nature. This makes it hard to verify or monitor complex, symbolic requirements. In this work, we present a simple, yet effective, approach to verify whether a trained convolutional neural network (CNN) respects specified symbolic background knowledge. The knowledge may consist of any fuzzy predicate logic rules. For this, we utilize methods from explainable artificial intelligence (XAI): First, using concept embedding analysis, the output of a computer vision CNN is post-hoc enriched by concept outputs; second, logical rules from prior knowledge are fuzzified to serve as continuous-valued functions on the concept outputs. These can be evaluated with little computational overhead. We demonstrate three diverse use-cases of our method on stateof-the-art object detectors: Finding corner cases, utilizing the rules for detecting and localizing DNN misbehavior during runtime, and comparing the logical consistency of DNNs. The latter is used to find related differences between EfficientDet D1 and Mask R-CNN object detectors. We show that this approach benefits from fuzziness and calibrating the concept outputs.
    Toward Causal-Aware RL: State-Wise Action-Refined Temporal Difference. (arXiv:2201.00354v1 [cs.LG])
    (2 min) Although it is well known that exploration plays a key role in Reinforcement Learning (RL), prevailing exploration strategies for continuous control tasks in RL are mainly based on naive isotropic Gaussian noise regardless of the causality relationship between action space and the task and consider all dimensions of actions equally important. In this work, we propose to conduct interventions on the primal action space to discover the causal relationship between the action space and the task reward. We propose the method of State-Wise Action Refined (SWAR), which addresses the issue of action space redundancy and promote causality discovery in RL. We formulate causality discovery in RL tasks as a state-dependent action space selection problem and propose two practical algorithms as solutions. The first approach, TD-SWAR, detects task-related actions during temporal difference learning, while the second approach, Dyn-SWAR, reveals important actions through dynamic model prediction. Empirically, both methods provide approaches to understand the decisions made by RL agents and improve learning efficiency in action-redundant tasks.
    The complementarity of a diverse range of deep learning features extracted from video content for video recommendation. (arXiv:2011.10834v2 [cs.IR] UPDATED)
    (2 min) Following the popularisation of media streaming, a number of video streaming services are continuously buying new video content to mine the potential profit from them. As such, the newly added content has to be handled well to be recommended to suitable users. In this paper, we address the new item cold-start problem by exploring the potential of various deep learning features to provide video recommendations. The deep learning features investigated include features that capture the visual-appearance, audio and motion information from video content. We also explore different fusion methods to evaluate how well these feature modalities can be combined to fully exploit the complementary information captured by them. Experiments on a real-world video dataset for movie recommendations show that deep learning features outperform hand-crafted features. In particular, recommendations generated with deep learning audio features and action-centric deep learning features are superior to MFCC and state-of-the-art iDT features. In addition, the combination of various deep learning features with hand-crafted features and textual metadata yields significant improvement in recommendations compared to combining only the former.
    Operator Deep Q-Learning: Zero-Shot Reward Transferring in Reinforcement Learning. (arXiv:2201.00236v1 [cs.LG])
    (2 min) Reinforcement learning (RL) has drawn increasing interests in recent years due to its tremendous success in various applications. However, standard RL algorithms can only be applied for single reward function, and cannot adapt to an unseen reward function quickly. In this paper, we advocate a general operator view of reinforcement learning, which enables us to directly approximate the operator that maps from reward function to value function. The benefit of learning the operator is that we can incorporate any new reward function as input and attain its corresponding value function in a zero-shot manner. To approximate this special type of operator, we design a number of novel operator neural network architectures based on its theoretical properties. Our design of operator networks outperform the existing methods and the standard design of general purpose operator network, and we demonstrate the benefit of our operator deep Q-learning framework in several tasks including reward transferring for offline policy evaluation (OPE) and reward transferring for offline policy optimization in a range of tasks.
    Multi-Dimensional Model Compression of Vision Transformer. (arXiv:2201.00043v1 [cs.CV])
    (2 min) Vision transformers (ViT) have recently attracted considerable attentions, but the huge computational cost remains an issue for practical deployment. Previous ViT pruning methods tend to prune the model along one dimension solely, which may suffer from excessive reduction and lead to sub-optimal model quality. In contrast, we advocate a multi-dimensional ViT compression paradigm, and propose to harness the redundancy reduction from attention head, neuron and sequence dimensions jointly. We firstly propose a statistical dependence based pruning criterion that is generalizable to different dimensions for identifying deleterious components. Moreover, we cast the multi-dimensional compression as an optimization, learning the optimal pruning policy across the three dimensions that maximizes the compressed model's accuracy under a computational budget. The problem is solved by our adapted Gaussian process search with expected improvement. Experimental results show that our method effectively reduces the computational cost of various ViT models. For example, our method reduces 40\% FLOPs without top-1 accuracy loss for DeiT and T2T-ViT models, outperforming previous state-of-the-arts.
    Application of Machine Learning Methods in Inferring Surface Water Groundwater Exchanges using High Temporal Resolution Temperature Measurements. (arXiv:2201.00726v1 [cs.LG])
    (2 min) We examine the ability of machine learning (ML) and deep learning (DL) algorithms to infer surface/ground exchange flux based on subsurface temperature observations. The observations and fluxes are produced from a high-resolution numerical model representing conditions in the Columbia River near the Department of Energy Hanford site located in southeastern Washington State. Random measurement error, of varying magnitude, is added to the synthetic temperature observations. The results indicate that both ML and DL methods can be used to infer the surface/ground exchange flux. DL methods, especially convolutional neural networks, outperform the ML methods when used to interpret noisy temperature data with a smoothing filter applied. However, the ML methods also performed well and they are can better identify a reduced number of important observations, which could be useful for measurement network optimization. Surprisingly, the ML and DL methods better inferred upward flux than downward flux. This is in direct contrast to previous findings using numerical models to infer flux from temperature observations and it may suggest that combined use of ML or DL inference with numerical inference could improve flux estimation beneath river systems.
    'Moving On' -- Investigating Inventors' Ethnic Origins Using Supervised Learning. (arXiv:2201.00578v1 [econ.GN])
    (2 min) Patent data provides rich information about technical inventions, but does not disclose the ethnic origin of inventors. In this paper, I use supervised learning techniques to infer this information. To do so, I construct a dataset of 95'202 labeled names and train an artificial recurrent neural network with long-short-term memory (LSTM) to predict ethnic origins based on names. The trained network achieves an overall performance of 91% across 17 ethnic origins. I use this model to classify and investigate the ethnic origins of 2.68 million inventors and provide novel descriptive evidence regarding their ethnic origin composition over time and across countries and technological fields. The global ethnic origin composition has become more diverse over the last decades, which was mostly due to a relative increase of Asian origin inventors. Furthermore, the prevalence of foreign-origin inventors is especially high in the USA, but has also increased in other high-income economies. This increase was mainly driven by an inflow of non-western inventors into emerging high-technology fields for the USA, but not for other high-income countries.
    Semi-supervised Stance Detection of Tweets Via Distant Network Supervision. (arXiv:2201.00614v1 [cs.SI])
    (2 min) Detecting and labeling stance in social media text is strongly motivated by hate speech detection, poll prediction, engagement forecasting, and concerted propaganda detection. Today's best neural stance detectors need large volumes of training data, which is difficult to curate given the fast-changing landscape of social media text and issues on which users opine. Homophily properties over the social network provide strong signal of coarse-grained user-level stance. But semi-supervised approaches for tweet-level stance detection fail to properly leverage homophily. In light of this, We present SANDS, a new semi-supervised stance detector. SANDS starts from very few labeled tweets. It builds multiple deep feature views of tweets. It also uses a distant supervision signal from the social network to provide a surrogate loss signal to the component learners. We prepare two new tweet datasets comprising over 236,000 politically tinted tweets from two demographics (US and India) posted by over 87,000 users, their follower-followee graph, and over 8,000 tweets annotated by linguists. SANDS achieves a macro-F1 score of 0.55 (0.49) on US (India)-based datasets, outperforming 17 baselines (including variants of SANDS) substantially, particularly for minority stance labels and noisy text. Numerous ablation experiments on SANDS disentangle the dynamics of textual and network-propagated stance signals.
    Adaptive Memory Networks with Self-supervised Learning for Unsupervised Anomaly Detection. (arXiv:2201.00464v1 [cs.LG])
    (2 min) Unsupervised anomaly detection aims to build models to effectively detect unseen anomalies by only training on the normal data. Although previous reconstruction-based methods have made fruitful progress, their generalization ability is limited due to two critical challenges. First, the training dataset only contains normal patterns, which limits the model generalization ability. Second, the feature representations learned by existing models often lack representativeness which hampers the ability to preserve the diversity of normal patterns. In this paper, we propose a novel approach called Adaptive Memory Network with Self-supervised Learning (AMSL) to address these challenges and enhance the generalization ability in unsupervised anomaly detection. Based on the convolutional autoencoder structure, AMSL incorporates a self-supervised learning module to learn general normal patterns and an adaptive memory fusion module to learn rich feature representations. Experiments on four public multivariate time series datasets demonstrate that AMSL significantly improves the performance compared to other state-of-the-art methods. Specifically, on the largest CAP sleep stage detection dataset with 900 million samples, AMSL outperforms the second-best baseline by \textbf{4}\%+ in both accuracy and F1 score. Apart from the enhanced generalization ability, AMSL is also more robust against input noise.
    Exploiting Bi-directional Global Transition Patterns and Personal Preferences for Missing POI Category Identification. (arXiv:2201.00014v1 [cs.LG])
    (2 min) Recent years have witnessed the increasing popularity of Location-based Social Network (LBSN) services, which provides unparalleled opportunities to build personalized Point-of-Interest (POI) recommender systems. Existing POI recommendation and location prediction tasks utilize past information for future recommendation or prediction from a single direction perspective, while the missing POI category identification task needs to utilize the check-in information both before and after the missing category. Therefore, a long-standing challenge is how to effectively identify the missing POI categories at any time in the real-world check-in data of mobile users. To this end, in this paper, we propose a novel neural network approach to identify the missing POI categories by integrating both bi-directional global non-personal transition patterns and personal preferences of users. Specifically, we delicately design an attention matching cell to model how well the check-in category information matches their non-personal transition patterns and personal preferences. Finally, we evaluate our model on two real-world datasets, which clearly validate its effectiveness compared with the state-of-the-art baselines. Furthermore, our model can be naturally extended to address next POI category recommendation and prediction tasks with competitive performance.
    SAE: Sequential Anchored Ensembles. (arXiv:2201.00649v1 [cs.LG])
    (2 min) Computing the Bayesian posterior of a neural network is a challenging task due to the high-dimensionality of the parameter space. Anchored ensembles approximate the posterior by training an ensemble of neural networks on anchored losses designed for the optima to follow the Bayesian posterior. Training an ensemble, however, becomes computationally expensive as its number of members grows since the full training procedure is repeated for each member. In this note, we present Sequential Anchored Ensembles (SAE), a lightweight alternative to anchored ensembles. Instead of training each member of the ensemble from scratch, the members are trained sequentially on losses sampled with high auto-correlation, hence enabling fast convergence of the neural networks and efficient approximation of the Bayesian posterior. SAE outperform anchored ensembles, for a given computational budget, on some benchmarks while showing comparable performance on the others and achieved 2nd and 3rd place in the light and extended tracks of the NeurIPS 2021 Approximate Inference in Bayesian Deep Learning competition.
    Matrix Decomposition and Applications. (arXiv:2201.00145v1 [math.NA])
    (2 min) In 1954, Alston S. Householder published Principles of Numerical Analysis, one of the first modern treatments on matrix decomposition that favored a (block) LU decomposition-the factorization of a matrix into the product of lower and upper triangular matrices. And now, matrix decomposition has become a core technology in machine learning, largely due to the development of the back propagation algorithm in fitting a neural network. The sole aim of this survey is to give a self-contained introduction to concepts and mathematical tools in numerical linear algebra and matrix analysis in order to seamlessly introduce matrix decomposition techniques and their applications in subsequent sections. However, we clearly realize our inability to cover all the useful and interesting results concerning matrix decomposition and given the paucity of scope to present this discussion, e.g., the separated analysis of the Euclidean space, Hermitian space, Hilbert space, and things in the complex domain. We refer the reader to literature in the field of linear algebra for a more detailed introduction to the related fields.
    Recover the spectrum of covariance matrix: a non-asymptotic iterative method. (arXiv:2201.00230v1 [stat.ML])
    (0 min) It is well known the sample covariance has a consistent bias in the spectrum, for example spectrum of Wishart matrix follows the Marchenko-Pastur law. We in this work introduce an iterative algorithm 'Concent' that actively eliminate this bias and recover the true spectrum for small and moderate dimensions.
    Artificial Intelligence and Statistical Techniques in Short-Term Load Forecasting: A Review. (arXiv:2201.00437v1 [cs.LG])
    (2 min) Electrical utilities depend on short-term demand forecasting to proactively adjust production and distribution in anticipation of major variations. This systematic review analyzes 240 works published in scholarly journals between 2000 and 2019 that focus on applying Artificial Intelligence (AI), statistical, and hybrid models to short-term load forecasting (STLF). This work represents the most comprehensive review of works on this subject to date. A complete analysis of the literature is conducted to identify the most popular and accurate techniques as well as existing gaps. The findings show that although Artificial Neural Networks (ANN) continue to be the most commonly used standalone technique, researchers have been exceedingly opting for hybrid combinations of different techniques to leverage the combined advantages of individual methods. The review demonstrates that it is commonly possible with these hybrid combinations to achieve prediction accuracy exceeding 99%. The most successful duration for short-term forecasting has been identified as prediction for a duration of one day at an hourly interval. The review has identified a deficiency in access to datasets needed for training of the models. A significant gap has been identified in researching regions other than Asia, Europe, North America, and Australia.
    Asymptotic Convergence of Deep Multi-Agent Actor-Critic Algorithms. (arXiv:2201.00570v1 [cs.LG])
    (2 min) We present sufficient conditions that ensure convergence of the multi-agent Deep Deterministic Policy Gradient (DDPG) algorithm. It is an example of one of the most popular paradigms of Deep Reinforcement Learning (DeepRL) for tackling continuous action spaces: the actor-critic paradigm. In the setting considered herein, each agent observes a part of the global state space in order to take local actions, for which it receives local rewards. For every agent, DDPG trains a local actor (policy) and a local critic (Q-function). The analysis shows that multi-agent DDPG using neural networks to approximate the local policies and critics converge to limits with the following properties: The critic limits minimize the average squared Bellman loss; the actor limits parameterize a policy that maximizes the local critic's approximation of $Q_i^*$, where $i$ is the agent index. The averaging is with respect to a probability distribution over the global state-action space. It captures the asymptotics of all local training processes. Finally, we extend the analysis to a fully decentralized setting where agents communicate over a wireless network prone to delays and losses; a typical scenario in, e.g., robotic applications.
    KerGNNs: Interpretable Graph Neural Networks with Graph Kernels. (arXiv:2201.00491v1 [cs.LG])
    (2 min) Graph kernels are historically the most widely-used technique for graph classification tasks. However, these methods suffer from limited performance because of the hand-crafted combinatorial features of graphs. In recent years, graph neural networks (GNNs) have become the state-of-the-art method in downstream graph-related tasks due to their superior performance. Most GNNs are based on Message Passing Neural Network (MPNN) frameworks. However, recent studies show that MPNNs can not exceed the power of the Weisfeiler-Lehman (WL) algorithm in graph isomorphism test. To address the limitations of existing graph kernel and GNN methods, in this paper, we propose a novel GNN framework, termed \textit{Kernel Graph Neural Networks} (KerGNNs), which integrates graph kernels into the message passing process of GNNs. Inspired by convolution filters in convolutional neural networks (CNNs), KerGNNs adopt trainable hidden graphs as graph filters which are combined with subgraphs to update node embeddings using graph kernels. In addition, we show that MPNNs can be viewed as special cases of KerGNNs. We apply KerGNNs to multiple graph-related tasks and use cross-validation to make fair comparisons with benchmarks. We show that our method achieves competitive performance compared with existing state-of-the-art methods, demonstrating the potential to increase the representation ability of GNNs. We also show that the trained graph filters in KerGNNs can reveal the local graph structures of the dataset, which significantly improves the model interpretability compared with conventional GNN models.
    Avoiding Catastrophe: Active Dendrites Enable Multi-Task Learning in Dynamic Environments. (arXiv:2201.00042v1 [cs.NE])
    (2 min) A key challenge for AI is to build embodied systems that operate in dynamically changing environments. Such systems must adapt to changing task contexts and learn continuously. Although standard deep learning systems achieve state of the art results on static benchmarks, they often struggle in dynamic scenarios. In these settings, error signals from multiple contexts can interfere with one another, ultimately leading to a phenomenon known as catastrophic forgetting. In this article we investigate biologically inspired architectures as solutions to these problems. Specifically, we show that the biophysical properties of dendrites and local inhibitory systems enable networks to dynamically restrict and route information in a context-specific manner. Our key contributions are as follows. First, we propose a novel artificial neural network architecture that incorporates active dendrites and sparse representations into the standard deep learning framework. Next, we study the performance of this architecture on two separate benchmarks requiring task-based adaptation: Meta-World, a multi-task reinforcement learning environment where a robotic agent must learn to solve a variety of manipulation tasks simultaneously; and a continual learning benchmark in which the model's prediction task changes throughout training. Analysis on both benchmarks demonstrates the emergence of overlapping but distinct and sparse subnetworks, allowing the system to fluidly learn multiple tasks with minimal forgetting. Our neural implementation marks the first time a single architecture has achieved competitive results on both multi-task and continual learning settings. Our research sheds light on how biological properties of neurons can inform deep learning systems to address dynamic scenarios that are typically impossible for traditional ANNs to solve.
    Role of Data Augmentation Strategies in Knowledge Distillation for Wearable Sensor Data. (arXiv:2201.00111v1 [cs.LG])
    (2 min) Deep neural networks are parametrized by several thousands or millions of parameters, and have shown tremendous success in many classification problems. However, the large number of parameters makes it difficult to integrate these models into edge devices such as smartphones and wearable devices. To address this problem, knowledge distillation (KD) has been widely employed, that uses a pre-trained high capacity network to train a much smaller network, suitable for edge devices. In this paper, for the first time, we study the applicability and challenges of using KD for time-series data for wearable devices. Successful application of KD requires specific choices of data augmentation methods during training. However, it is not yet known if there exists a coherent strategy for choosing an augmentation approach during KD. In this paper, we report the results of a detailed study that compares and contrasts various common choices and some hybrid data augmentation strategies in KD based human activity analysis. Research in this area is often limited as there are not many comprehensive databases available in the public domain from wearable devices. Our study considers databases from small scale publicly available to one derived from a large scale interventional study into human activity and sedentary behavior. We find that the choice of data augmentation techniques during KD have a variable level of impact on end performance, and find that the optimal network choice as well as data augmentation strategies are specific to a dataset at hand. However, we also conclude with a general set of recommendations that can provide a strong baseline performance across databases.
    Graph Signal Reconstruction Techniques for IoT Air Pollution Monitoring Platforms. (arXiv:2201.00378v1 [eess.SP])
    (2 min) Air pollution monitoring platforms play a very important role in preventing and mitigating the effects of pollution. Recent advances in the field of graph signal processing have made it possible to describe and analyze air pollution monitoring networks using graphs. One of the main applications is the reconstruction of the measured signal in a graph using a subset of sensors. Reconstructing the signal using information from sensor neighbors can help improve the quality of network data, examples are filling in missing data with correlated neighboring nodes, or correcting a drifting sensor with neighboring sensors that are more accurate. This paper compares the use of various types of graph signal reconstruction methods applied to real data sets of Spanish air pollution reference stations. The methods considered are Laplacian interpolation, graph signal processing low-pass based graph signal reconstruction, and kernel-based graph signal reconstruction, and are compared on actual air pollution data sets measuring O3, NO2, and PM10. The ability of the methods to reconstruct the signal of a pollutant is shown, as well as the computational cost of this reconstruction. The results indicate the superiority of methods based on kernel-based graph signal reconstruction, as well as the difficulties of the methods to scale in an air pollution monitoring network with a large number of low-cost sensors. However, we show that scalability can be overcome with simple methods, such as partitioning the network using a clustering algorithm.
    An Efficient Federated Distillation Learning System for Multi-task Time Series Classification. (arXiv:2201.00011v1 [cs.LG])
    (2 min) This paper proposes an efficient federated distillation learning system (EFDLS) for multi-task time series classification (TSC). EFDLS consists of a central server and multiple mobile users, where different users may run different TSC tasks. EFDLS has two novel components, namely a feature-based student-teacher (FBST) framework and a distance-based weights matching (DBWM) scheme. Within each user, the FBST framework transfers knowledge from its teacher's hidden layers to its student's hidden layers via knowledge distillation, with the teacher and student having identical network structure. For each connected user, its student model's hidden layers' weights are uploaded to the EFDLS server periodically. The DBWM scheme is deployed on the server, with the least square distance used to measure the similarity between the weights of two given models. This scheme finds a partner for each connected user such that the user's and its partner's weights are the closest among all the weights uploaded. The server exchanges and sends back the user's and its partner's weights to these two users which then load the received weights to their teachers' hidden layers. Experimental results show that the proposed EFDLS achieves excellent performance on a set of selected UCR2018 datasets regarding top-1 accuracy.
    Semi-Supervised Graph Attention Networks for Event Representation Learning. (arXiv:2201.00363v1 [cs.LG])
    (2 min) Event analysis from news and social networks is very useful for a wide range of social studies and real-world applications. Recently, event graphs have been explored to model event datasets and their complex relationships, where events are vertices connected to other vertices representing locations, people's names, dates, and various other event metadata. Graph representation learning methods are promising for extracting latent features from event graphs to enable the use of different classification algorithms. However, existing methods fail to meet essential requirements for event graphs, such as (i) dealing with semi-supervised graph embedding to take advantage of some labeled events, (ii) automatically determining the importance of the relationships between event vertices and their metadata vertices, as well as (iii) dealing with the graph heterogeneity. This paper presents GNEE (GAT Neural Event Embeddings), a method that combines Graph Attention Networks and Graph Regularization. First, an event graph regularization is proposed to ensure that all graph vertices receive event features, thereby mitigating the graph heterogeneity drawback. Second, semi-supervised graph embedding with self-attention mechanism considers existing labeled events, as well as learns the importance of relationships in the event graph during the representation learning process. A statistical analysis of experimental results with five real-world event graphs and six graph embedding methods shows that our GNEE outperforms state-of-the-art semi-supervised graph embedding methods.
    Augmentative eXplanation and the Distributional Gap of Confidence Optimization Score. (arXiv:2201.00009v1 [cs.LG])
    (2 min) This paper introduces the Confidence Optimization (CO) score to directly measure the contribution of heatmaps/saliency maps to the classification performance of a model. Common heatmap generation methods used in the eXplainable Artificial Intelligence (XAI) community are tested through a process we call the Augmentative eXplanation (AX). We find a surprising \textit{gap} in CO scores distribution on these heatmap methods. The gap potentially serves as a novel indicator for the correctness of deep neural network (DNN) prediction. We further introduces Generative AX (GAX) method to generate saliency maps capable of attaining high CO scores. Using GAX, we also qualitatively demonstrate the unintuitiveness of DNN architectures.
    Validation and Transparency in AI systems for pharmacovigilance: a case study applied to the medical literature monitoring of adverse events. (arXiv:2201.00692v1 [cs.IR])
    (2 min) Recent advances in artificial intelligence applied to biomedical text are opening exciting opportunities for improving pharmacovigilance activities currently burdened by the ever growing volumes of real world data. To fully realize these opportunities, existing regulatory guidance and industry best practices should be taken into consideration in order to increase the overall trustworthiness of the system and enable broader adoption. In this paper we present a case study on how to operationalize existing guidance for validated AI systems in pharmacovigilance focusing on the specific task of medical literature monitoring (MLM) of adverse events from the scientific literature. We describe an AI system designed with the goal of reducing effort in MLM activities built in close collaboration with subject matter experts and considering guidance for validated systems in pharmacovigilance and AI transparency. In particular we make use of public disclosures as a useful risk control measure to mitigate system misuse and earn user trust. In addition we present experimental results showing the system can significantly remove screening effort while maintaining high levels of recall (filtering 55% of irrelevant articles on average, for a target recall of 0.99 on suspected adverse articles) and provide a robust method for tuning the desired recall to suit a particular risk profile.
    A Literature Review on Length of Stay Prediction for Stroke Patients using Machine Learning and Statistical Approaches. (arXiv:2201.00005v1 [cs.LG])
    (2 min) Hospital length of stay (LOS) is one of the most essential healthcare metrics that reflects the hospital quality of service and helps improve hospital scheduling and management. LOS prediction helps in cost management because patients who remain in hospitals usually do so in hospital units where resources are severely limited. In this study, we reviewed papers on LOS prediction using machine learning and statistical approaches. Our literature review considers research studies that focus on LOS prediction for stroke patients. Some of the surveyed studies revealed that authors reached contradicting conclusions. For example, the age of the patient was considered an important predictor of LOS for stroke patients in some studies, while other studies concluded that age was not a significant factor. Therefore, additional research is required in this domain to further understand the predictors of LOS for stroke patients.
    Performance Comparison of Deep Learning Architectures for Artifact Removal in Gastrointestinal Endoscopic Imaging. (arXiv:2201.00084v1 [eess.IV])
    (2 min) Endoscopic images typically contain several artifacts. The artifacts significantly impact image analysis result in computer-aided diagnosis. Convolutional neural networks (CNNs), a type of deep learning, can removes such artifacts. Various architectures have been proposed for the CNNs, and the accuracy of artifact removal varies depending on the choice of architecture. Therefore, it is necessary to determine the artifact removal accuracy, depending on the selected architecture. In this study, we focus on endoscopic surgical instruments as artifacts, and determine and discuss the artifact removal accuracy using seven different CNN architectures.
    How do lexical semantics affect translation? An empirical study. (arXiv:2201.00075v1 [cs.CL])
    (2 min) Neural machine translation (NMT) systems aim to map text from one language into another. While there are a wide variety of applications of NMT, one of the most important is translation of natural language. A distinguishing factor of natural language is that words are typically ordered according to the rules of the grammar of a given language. Although many advances have been made in developing NMT systems for translating natural language, little research has been done on understanding how the word ordering of and lexical similarity between the source and target language affect translation performance. Here, we investigate these relationships on a variety of low-resource language pairs from the OpenSubtitles2016 database, where the source language is English, and find that the more similar the target language is to English, the greater the translation performance. In addition, we study the impact of providing NMT models with part of speech of words (POS) in the English sequence and find that, for Transformer-based models, the more dissimilar the target language is from English, the greater the benefit provided by POS.
    Covariate-assisted Sparse Tensor Completion. (arXiv:2103.06428v2 [stat.ML] UPDATED)
    (2 min) We aim to provably complete a sparse and highly-missing tensor in the presence of covariate information along tensor modes. Our motivation comes from online advertising where users click-through-rates (CTR) on ads over various devices form a CTR tensor that has about 96% missing entries and has many zeros on non-missing entries, which makes the standalone tensor completion method unsatisfactory. Beside the CTR tensor, additional ad features or user characteristics are often available. In this paper, we propose Covariate-assisted Sparse Tensor Completion (COSTCO) to incorporate covariate information for the recovery of the sparse tensor. The key idea is to jointly extract latent components from both the tensor and the covariate matrix to learn a synthetic representation. Theoretically, we derive the error bound for the recovered tensor components and explicitly quantify the improvements on both the reveal probability condition and the tensor recovery accuracy due to covariates. Finally, we apply COSTCO to an advertisement dataset consisting of a CTR tensor and ad covariate matrix, leading to 23% accuracy improvement over the baseline. An important by-product is that ad latent components from COSTCO reveal interesting ad clusters, which are useful for better ad targeting.
    A General Descent Aggregation Framework for Gradient-based Bi-level Optimization. (arXiv:2102.07976v3 [cs.LG] UPDATED)
    (0 min) In recent years, a variety of gradient-based methods have been developed to solve Bi-Level Optimization (BLO) problems in machine learning and computer vision areas. However, the theoretical correctness and practical effectiveness of these existing approaches always rely on some restrictive conditions (e.g., Lower-Level Singleton, LLS), which could hardly be satisfied in real-world applications. Moreover, previous literature only proves theoretical results based on their specific iteration strategies, thus lack a general recipe to uniformly analyze the convergence behaviors of different gradient-based BLOs. In this work, we formulate BLOs from an optimistic bi-level viewpoint and establish a new gradient-based algorithmic framework, named Bi-level Descent Aggregation (BDA), to partially address the above issues. Specifically, BDA provides a modularized structure to hierarchically aggregate both the upper- and lower-level subproblems to generate our bi-level iterative dynamics. Theoretically, we establish a general convergence analysis template and derive a new proof recipe to investigate the essential theoretical properties of gradient-based BLO methods. Furthermore, this work systematically explores the convergence behavior of BDA in different optimization scenarios, i.e., considering various solution qualities (i.e., global/local/stationary solution) returned from solving approximation subproblems. Extensive experiments justify our theoretical results and demonstrate the superiority of the proposed algorithm for hyper-parameter optimization and meta-learning tasks. Source code is available at https://github.com/vis-opt-group/BDA.
    Succinct Differentiation of Disparate Boosting Ensemble Learning Methods for Prognostication of Polycystic Ovary Syndrome Diagnosis. (arXiv:2201.00418v1 [cs.LG])
    (0 min) Prognostication of medical problems using the clinical data by leveraging the Machine Learning techniques with stellar precision is one of the most important real world challenges at the present time. Considering the medical problem of Polycystic Ovary Syndrome also known as PCOS is an emerging problem in women aged from 15 to 49. Diagnosing this disorder by using various Boosting Ensemble Methods is something we have presented in this paper. A detailed and compendious differentiation between Adaptive Boost, Gradient Boosting Machine, XGBoost and CatBoost with their respective performance metrics highlighting the hidden anomalies in the data and its effects on the result is something we have presented in this paper. Metrics like Confusion Matrix, Precision, Recall, F1 Score, FPR, RoC Curve and AUC have been used in this paper.
    BARACK: Partially Supervised Group Robustness With Guarantees. (arXiv:2201.00072v1 [cs.LG])
    (2 min) While neural networks have shown remarkable success on classification tasks in terms of average-case performance, they often fail to perform well on certain groups of the data. Such group information may be expensive to obtain; thus, recent works in robustness and fairness have proposed ways to improve worst-group performance even when group labels are unavailable for the training data. However, these methods generally underperform methods that utilize group information at training time. In this work, we assume access to a small number of group labels alongside a larger dataset without group labels. We propose BARACK, a simple two-step framework to utilize this partial group information to improve worst-group performance: train a model to predict the missing group labels for the training data, and then use these predicted group labels in a robust optimization objective. Theoretically, we provide generalization bounds for our approach in terms of the worst-group performance, showing how the generalization error scales with respect to both the total number of training points and the number of training points with group labels. Empirically, our method outperforms the baselines that do not use group information, even when only 1-33% of points have group labels. We provide ablation studies to support the robustness and extensibility of our framework.
    Dynamic Least-Squares Regression. (arXiv:2201.00228v1 [cs.DS])
    (2 min) A common challenge in large-scale supervised learning, is how to exploit new incremental data to a pre-trained model, without re-training the model from scratch. Motivated by this problem, we revisit the canonical problem of dynamic least-squares regression (LSR), where the goal is to learn a linear model over incremental training data. In this setup, data and labels $(\mathbf{A}^{(t)}, \mathbf{b}^{(t)}) \in \mathbb{R}^{t \times d}\times \mathbb{R}^t$ evolve in an online fashion ($t\gg d$), and the goal is to efficiently maintain an (approximate) solution to $\min_{\mathbf{x}^{(t)}} \| \mathbf{A}^{(t)} \mathbf{x}^{(t)} - \mathbf{b}^{(t)} \|_2$ for all $t\in [T]$. Our main result is a dynamic data structure which maintains an arbitrarily small constant approximate solution to dynamic LSR with amortized update time $O(d^{1+o(1)})$, almost matching the running time of the static (sketching-based) solution. By contrast, for exact (or even $1/\mathrm{poly}(n)$-accuracy) solutions, we show a separation between the static and dynamic settings, namely, that dynamic LSR requires $\Omega(d^{2-o(1)})$ amortized update time under the OMv Conjecture (Henzinger et al., STOC'15). Our data structure is conceptually simple, easy to implement, and fast both in theory and practice, as corroborated by experiments over both synthetic and real-world datasets.
    Evaluating Deep Music Generation Methods Using Data Augmentation. (arXiv:2201.00052v1 [cs.SD])
    (2 min) Despite advances in deep algorithmic music generation, evaluation of generated samples often relies on human evaluation, which is subjective and costly. We focus on designing a homogeneous, objective framework for evaluating samples of algorithmically generated music. Any engineered measures to evaluate generated music typically attempt to define the samples' musicality, but do not capture qualities of music such as theme or mood. We do not seek to assess the musical merit of generated music, but instead explore whether generated samples contain meaningful information pertaining to emotion or mood/theme. We achieve this by measuring the change in predictive performance of a music mood/theme classifier after augmenting its training data with generated samples. We analyse music samples generated by three models -- SampleRNN, Jukebox, and DDSP -- and employ a homogeneous framework across all methods to allow for objective comparison. This is the first attempt at augmenting a music genre classification dataset with conditionally generated music. We investigate the classification performance improvement using deep music generation and the ability of the generators to make emotional music by using an additional, emotion annotation of the dataset. Finally, we use a classifier trained on real data to evaluate the label validity of class-conditionally generated samples.
    Multi-view Subspace Adaptive Learning via Autoencoder and Attention. (arXiv:2201.00171v1 [cs.LG])
    (2 min) Multi-view learning can cover all features of data samples more comprehensively, so multi-view learning has attracted widespread attention. Traditional subspace clustering methods, such as sparse subspace clustering (SSC) and low-ranking subspace clustering (LRSC), cluster the affinity matrix for a single view, thus ignoring the problem of fusion between views. In our article, we propose a new Multiview Subspace Adaptive Learning based on Attention and Autoencoder (MSALAA). This method combines a deep autoencoder and a method for aligning the self-representations of various views in Multi-view Low-Rank Sparse Subspace Clustering (MLRSSC), which can not only increase the capability to non-linearity fitting, but also can meets the principles of consistency and complementarity of multi-view learning. We empirically observe significant improvement over existing baseline methods on six real-life datasets.
    Semantic Search for Large Scale Clinical Ontologies. (arXiv:2201.00118v1 [cs.CL])
    (2 min) Finding concepts in large clinical ontologies can be challenging when queries use different vocabularies. A search algorithm that overcomes this problem is useful in applications such as concept normalisation and ontology matching, where concepts can be referred to in different ways, using different synonyms. In this paper, we present a deep learning based approach to build a semantic search system for large clinical ontologies. We propose a Triplet-BERT model and a method that generates training data directly from the ontologies. The model is evaluated using five real benchmark data sets and the results show that our approach achieves high results on both free text to concept and concept to concept searching tasks, and outperforms all baseline methods.
    Towards Robust Graph Neural Networks for Noisy Graphs with Sparse Labels. (arXiv:2201.00232v1 [cs.LG])
    (2 min) Graph Neural Networks (GNNs) have shown their great ability in modeling graph structured data. However, real-world graphs usually contain structure noises and have limited labeled nodes. The performance of GNNs would drop significantly when trained on such graphs, which hinders the adoption of GNNs on many applications. Thus, it is important to develop noise-resistant GNNs with limited labeled nodes. However, the work on this is rather limited. Therefore, we study a novel problem of developing robust GNNs on noisy graphs with limited labeled nodes. Our analysis shows that both the noisy edges and limited labeled nodes could harm the message-passing mechanism of GNNs. To mitigate these issues, we propose a novel framework which adopts the noisy edges as supervision to learn a denoised and dense graph, which can down-weight or eliminate noisy edges and facilitate message passing of GNNs to alleviate the issue of limited labeled nodes. The generated edges are further used to regularize the predictions of unlabeled nodes with label smoothness to better train GNNs. Experimental results on real-world datasets demonstrate the robustness of the proposed framework on noisy graphs with limited labeled nodes.
    Hybrid intelligence for dynamic job-shop scheduling with deep reinforcement learning and attention mechanism. (arXiv:2201.00548v1 [cs.AI])
    (2 min) The dynamic job-shop scheduling problem (DJSP) is a class of scheduling tasks that specifically consider the inherent uncertainties such as changing order requirements and possible machine breakdown in realistic smart manufacturing settings. Since traditional methods cannot dynamically generate effective scheduling strategies in face of the disturbance of environments, we formulate the DJSP as a Markov decision process (MDP) to be tackled by reinforcement learning (RL). For this purpose, we propose a flexible hybrid framework that takes disjunctive graphs as states and a set of general dispatching rules as the action space with minimum prior domain knowledge. The attention mechanism is used as the graph representation learning (GRL) module for the feature extraction of states, and the double dueling deep Q-network with prioritized replay and noisy networks (D3QPN) is employed to map each state to the most appropriate dispatching rule. Furthermore, we present Gymjsp, a public benchmark based on the well-known OR-Library, to provide a standardized off-the-shelf facility for RL and DJSP research communities. Comprehensive experiments on various DJSP instances confirm that our proposed framework is superior to baseline algorithms with smaller makespan across all instances and provide empirical justification for the validity of the various components in the hybrid framework.
    Knowledge intensive state design for traffic signal control. (arXiv:2201.00006v1 [cs.LG])
    (2 min) There is a general trend of applying reinforcement learning (RL) techniques for traffic signal control (TSC). Recently, most studies pay attention to the neural network design and rarely concentrate on the state representation. Does the design of state representation has a good impact on TSC? In this paper, we (1) propose an effective state representation as queue length of vehicles with intensive knowledge; (2) present a TSC method called MaxQueue based on our state representation approach; (3) develop a general RL-based TSC template called QL-XLight with queue length as state and reward and generate QL-FRAP, QL-CoLight, and QL-DQN by our QL-XLight template based on traditional and latest RL models.Through comprehensive experiments on multiple real-world datasets, we demonstrate that: (1) our MaxQueue method outperforms the latest RL based methods; (2) QL-FRAP and QL-CoLight achieves a new state-of-the-art (SOTA). In general, state representation with intensive knowledge is also essential for TSC methods. Our code is released on Github.
    ECOD: Unsupervised Outlier Detection Using Empirical Cumulative Distribution Functions. (arXiv:2201.00382v1 [cs.LG])
    (2 min) Outlier detection refers to the identification of data points that deviate from a general data distribution. Existing unsupervised approaches often suffer from high computational cost, complex hyperparameter tuning, and limited interpretability, especially when working with large, high-dimensional datasets. To address these issues, we present a simple yet effective algorithm called ECOD (Empirical-Cumulative-distribution-based Outlier Detection), which is inspired by the fact that outliers are often the "rare events" that appear in the tails of a distribution. In a nutshell, ECOD first estimates the underlying distribution of the input data in a nonparametric fashion by computing the empirical cumulative distribution per dimension of the data. ECOD then uses these empirical distributions to estimate tail probabilities per dimension for each data point. Finally, ECOD computes an outlier score of each data point by aggregating estimated tail probabilities across dimensions. Our contributions are as follows: (1) we propose a novel outlier detection method called ECOD, which is both parameter-free and easy to interpret; (2) we perform extensive experiments on 30 benchmark datasets, where we find that ECOD outperforms 11 state-of-the-art baselines in terms of accuracy, efficiency, and scalability; and (3) we release an easy-to-use and scalable (with distributed support) Python implementation for accessibility and reproducibility.
    Thinking inside the box: A tutorial on grey-box Bayesian optimization. (arXiv:2201.00272v1 [cs.LG])
    (2 min) Bayesian optimization (BO) is a framework for global optimization of expensive-to-evaluate objective functions. Classical BO methods assume that the objective function is a black box. However, internal information about objective function computation is often available. For example, when optimizing a manufacturing line's throughput with simulation, we observe the number of parts waiting at each workstation, in addition to the overall throughput. Recent BO methods leverage such internal information to dramatically improve performance. We call these "grey-box" BO methods because they treat objective computation as partially observable and even modifiable, blending the black-box approach with so-called "white-box" first-principles knowledge of objective function computation. This tutorial describes these methods, focusing on BO of composite objective functions, where one can observe and selectively evaluate individual constituents that feed into the overall objective; and multi-fidelity BO, where one can evaluate cheaper approximations of the objective function by varying parameters of the evaluation oracle.
    Transfer-learning-based Surrogate Model for Thermal Conductivity of Nanofluids. (arXiv:2201.00435v1 [physics.flu-dyn])
    (2 min) Heat transfer characteristics of nanofluids have been extensively studied since the 1990s. Research investigations show that the suspended nanoparticles significantly alter the suspension's thermal properties. The thermal conductivity of nanofluids is one of the properties that is generally found to be greater than that of the base fluid. This increase in thermal conductivity is found to depend on several parameters. Several theories have been proposed to model the thermal conductivities of nanofluids, but there is no reliable universal theory yet to model the anomalous thermal conductivity of nanofluids. In recent years, supervised data-driven methods have been successfully employed to create surrogate models across various scientific disciplines, especially for modeling difficult-to-understand phenomena. These supervised learning methods allow the models to capture highly non-linear phenomena. In this work, we have taken advantage of existing correlations and used them concurrently with available experimental results to develop more robust surrogate models for predicting the thermal conductivity of nanofluids. Artificial neural networks are trained using the transfer learning approach to predict the thermal conductivity enhancement of nanofluids with spherical particles for 32 different particle-fluid combinations (8 particles materials and 4 fluids). The large amount of lower accuracy data generated from correlations is used to coarse-tune the model parameters, and the limited amount of more trustworthy experimental data is used to fine-tune the model parameters. The transfer learning-based models' results are compared with those from baseline models which are trained only on experimental data using a goodness of fit metric. It is found that the transfer learning models perform better with goodness of fit values of 0.93 as opposed to 0.83 from the baseline models.
    TransLog: A Unified Transformer-based Framework for Log Anomaly Detection. (arXiv:2201.00016v1 [cs.LG])
    (2 min) Log anomaly detection is a key component in the field of artificial intelligence for IT operations (AIOps). Considering log data of variant domains, retraining the whole network for unknown domains is inefficient in real industrial scenarios especially for low-resource domains. However, previous deep models merely focused on extracting the semantics of log sequence in the same domain, leading to poor generalization on multi-domain logs. Therefore, we propose a unified Transformer-based framework for log anomaly detection (\ourmethod{}), which is comprised of the pretraining and adapter-based tuning stage. Our model is first pretrained on the source domain to obtain shared semantic knowledge of log data. Then, we transfer the pretrained model to the target domain via the adapter-based tuning. The proposed method is evaluated on three public datasets including one source domain and two target domains. The experimental results demonstrate that our simple yet efficient approach, with fewer trainable parameters and lower training costs in the target domain, achieves state-of-the-art performance on three benchmarks.
    Overview of the EEG Pilot Subtask at MediaEval 2021: Predicting Media Memorability. (arXiv:2201.00620v1 [q-bio.NC])
    (2 min) The aim of the Memorability-EEG pilot subtask at MediaEval'2021 is to promote interest in the use of neural signals -- either alone or in combination with other data sources -- in the context of predicting video memorability by highlighting the utility of EEG data. The dataset created consists of pre-extracted features from EEG recordings of subjects while watching a subset of videos from Predicting Media Memorability subtask 1. This demonstration pilot gives interested researchers a sense of how neural signals can be used without any prior domain knowledge, and enables them to do so in a future memorability task. The dataset can be used to support the exploration of novel machine learning and processing strategies for predicting video memorability, while potentially increasing interdisciplinary interest in the subject of memorability, and opening the door to new combined EEG-computer vision approaches.
    An analysis of over-sampling labeled data in semi-supervised learning with FixMatch. (arXiv:2201.00604v1 [cs.LG])
    (2 min) Most semi-supervised learning methods over-sample labeled data when constructing training mini-batches. This paper studies whether this common practice improves learning and how. We compare it to an alternative setting where each mini-batch is uniformly sampled from all the training data, labeled or not, which greatly reduces direct supervision from true labels in typical low-label regimes. However, this simpler setting can also be seen as more general and even necessary in multi-task problems where over-sampling labeled data would become intractable. Our experiments on semi-supervised CIFAR-10 image classification using FixMatch show a performance drop when using the uniform sampling approach which diminishes when the amount of labeled data or the training time increases. Further, we analyse the training dynamics to understand how over-sampling of labeled data compares to uniform sampling. Our main finding is that over-sampling is especially beneficial early in training but gets less important in the later stages when more pseudo-labels become correct. Nevertheless, we also find that keeping some true labels remains important to avoid the accumulation of confirmation errors from incorrect pseudo-labels.
    The GatedTabTransformer. An enhanced deep learning architecture for tabular modeling. (arXiv:2201.00199v1 [cs.LG])
    (2 min) There is an increasing interest in the application of deep learning architectures to tabular data. One of the state-of-the-art solutions is TabTransformer which incorporates an attention mechanism to better track relationships between categorical features and then makes use of a standard MLP to output its final logits. In this paper we propose multiple modifications to the original TabTransformer performing better on binary classification tasks for three separate datasets with more than 1% AUROC gains. Inspired by gated MLP, linear projections are implemented in the MLP block and multiple activation functions are tested. We also evaluate the importance of specific hyper parameters during training.
    Optimal Representations for Covariate Shift. (arXiv:2201.00057v1 [cs.LG])
    (2 min) Machine learning systems often experience a distribution shift between training and testing. In this paper, we introduce a simple variational objective whose optima are exactly the set of all representations on which risk minimizers are guaranteed to be robust to any distribution shift that preserves the Bayes predictor, e.g., covariate shifts. Our objective has two components. First, a representation must remain discriminative for the task, i.e., some predictor must be able to simultaneously minimize the source and target risk. Second, the representation's marginal support needs to be the same across source and target. We make this practical by designing self-supervised learning methods that only use unlabelled data and augmentations to train robust representations. Our objectives achieve state-of-the-art results on DomainBed, and give insights into the robustness of recent methods, such as CLIP.
    Revisiting PGD Attacks for Stability Analysis of Large-Scale Nonlinear Systems and Perception-Based Control. (arXiv:2201.00801v1 [math.OC])
    (2 min) Many existing region-of-attraction (ROA) analysis tools find difficulty in addressing feedback systems with large-scale neural network (NN) policies and/or high-dimensional sensing modalities such as cameras. In this paper, we tailor the projected gradient descent (PGD) attack method developed in the adversarial learning community as a general-purpose ROA analysis tool for large-scale nonlinear systems and end-to-end perception-based control. We show that the ROA analysis can be approximated as a constrained maximization problem whose goal is to find the worst-case initial condition which shifts the terminal state the most. Then we present two PGD-based iterative methods which can be used to solve the resultant constrained maximization problem. Our analysis is not based on Lyapunov theory, and hence requires minimum information of the problem structures. In the model-based setting, we show that the PGD updates can be efficiently performed using back-propagation. In the model-free setting (which is more relevant to ROA analysis of perception-based control), we propose a finite-difference PGD estimate which is general and only requires a black-box simulator for generating the trajectories of the closed-loop system given any initial state. We demonstrate the scalability and generality of our analysis tool on several numerical examples with large-scale NN policies and high-dimensional image observations. We believe that our proposed analysis serves as a meaningful initial step toward further understanding of closed-loop stability of large-scale nonlinear systems and perception-based control.
    Stochastic convex optimization for provably efficient apprenticeship learning. (arXiv:2201.00039v1 [cs.LG])
    (2 min) We consider large-scale Markov decision processes (MDPs) with an unknown cost function and employ stochastic convex optimization tools to address the problem of imitation learning, which consists of learning a policy from a finite set of expert demonstrations. We adopt the apprenticeship learning formalism, which carries the assumption that the true cost function can be represented as a linear combination of some known features. Existing inverse reinforcement learning algorithms come with strong theoretical guarantees, but are computationally expensive because they use reinforcement learning or planning algorithms as a subroutine. On the other hand, state-of-the-art policy gradient based algorithms (like IM-REINFORCE, IM-TRPO, and GAIL), achieve significant empirical success in challenging benchmark tasks, but are not well understood in terms of theory. With an emphasis on non-asymptotic guarantees of performance, we propose a method that directly learns a policy from expert demonstrations, bypassing the intermediate step of learning the cost function, by formulating the problem as a single convex optimization problem over occupancy measures. We develop a computationally efficient algorithm and derive high confidence regret bounds on the quality of the extracted policy, utilizing results from stochastic convex optimization and recent works in approximate linear programming for solving forward MDPs.
    Deep Nonparametric Estimation of Operators between Infinite Dimensional Spaces. (arXiv:2201.00217v1 [stat.ML])
    (2 min) Learning operators between infinitely dimensional spaces is an important learning task arising in wide applications in machine learning, imaging science, mathematical modeling and simulations, etc. This paper studies the nonparametric estimation of Lipschitz operators using deep neural networks. Non-asymptotic upper bounds are derived for the generalization error of the empirical risk minimizer over a properly chosen network class. Under the assumption that the target operator exhibits a low dimensional structure, our error bounds decay as the training sample size increases, with an attractive fast rate depending on the intrinsic dimension in our estimation. Our assumptions cover most scenarios in real applications and our results give rise to fast rates by exploiting low dimensional structures of data in operator estimation. We also investigate the influence of network structures (e.g., network width, depth, and sparsity) on the generalization error of the neural network estimator and propose a general suggestion on the choice of network structures to maximize the learning efficiency quantitatively.
    Experiment Based Crafting and Analyzing of Machine Learning Solutions. (arXiv:2201.00355v1 [cs.LG])
    (2 min) The crafting of machine learning (ML) based systems requires statistical control throughout its life cycle. Careful quantification of business requirements and identification of key factors that impact the business requirements reduces the risk of a project failure. The quantification of business requirements results in the definition of random variables representing the system key performance indicators that need to be analyzed through statistical experiments. In addition, available data for training and experiment results impact the design of the system. Once the system is developed, it is tested and continually monitored to ensure it meets its business requirements. This is done through the continued application of statistical experiments to analyze and control the key performance indicators. This book teaches the art of crafting and developing ML based systems. It advocates an "experiment first" approach stressing the need to define statistical experiments from the beginning of the project life cycle. It also discusses in detail how to apply statistical control on the ML based system throughout its lifecycle.
    Croesus: Multi-Stage Processing and Transactions for Video-Analytics in Edge-Cloud Systems. (arXiv:2201.00063v1 [eess.SY])
    (2 min) Emerging edge applications require both a fast response latency and complex processing. This is infeasible without expensive hardware that can process complex operations -- such as object detection -- within a short time. Many approach this problem by addressing the complexity of the models -- via model compression, pruning and quantization -- or compressing the input. In this paper, we propose a different perspective when addressing the performance challenges. Croesus is a multi-stage approach to edge-cloud systems that provides the ability to find the balance between accuracy and performance. Croesus consists of two stages (that can be generalized to multiple stages): an initial and a final stage. The initial stage performs the computation in real-time using approximate/best-effort computation at the edge. The final stage performs the full computation at the cloud, and uses the results to correct any errors made at the initial stage. In this paper, we demonstrate the implications of such an approach on a video analytics use-case and show how multi-stage processing yields a better balance between accuracy and performance. Moreover, we study the safety of multi-stage transactions via two proposals: multi-stage serializability (MS-SR) and multi-stage invariant confluence with Apologies (MS-IA).
    Actor-Critic Network for Q&A in an Adversarial Environment. (arXiv:2201.00455v1 [cs.CL])
    (2 min) Significant work has been placed in the Q&A NLP space to build models that are more robust to adversarial attacks. Two key areas of focus are in generating adversarial data for the purposes of training against these situations or modifying existing architectures to build robustness within. This paper introduces an approach that joins these two ideas together to train a critic model for use in an almost reinforcement learning framework. Using the Adversarial SQuAD "Add One Sent" dataset we show that there are some promising signs for this method in protecting against Adversarial attacks.
    DiffuseVAE: Efficient, Controllable and High-Fidelity Generation from Low-Dimensional Latents. (arXiv:2201.00308v1 [cs.LG])
    (2 min) Diffusion Probabilistic models have been shown to generate state-of-the-art results on several competitive image synthesis benchmarks but lack a low-dimensional, interpretable latent space, and are slow at generation. On the other hand, Variational Autoencoders (VAEs) typically have access to a low-dimensional latent space but exhibit poor sample quality. Despite recent advances, VAEs usually require high-dimensional hierarchies of the latent codes to generate high-quality samples. We present DiffuseVAE, a novel generative framework that integrates VAE within a diffusion model framework, and leverage this to design a novel conditional parameterization for diffusion models. We show that the resulting model can improve upon the unconditional diffusion model in terms of sampling efficiency while also equipping diffusion models with the low-dimensional VAE inferred latent code. Furthermore, we show that the proposed model can generate high-resolution samples and exhibits synthesis quality comparable to state-of-the-art models on standard benchmarks. Lastly, we show that the proposed method can be used for controllable image synthesis and also exhibits out-of-the-box capabilities for downstream tasks like image super-resolution and denoising. For reproducibility, our source code is publicly available at \url{https://github.com/kpandey008/DiffuseVAE}.
  • stat.ML updates on arXiv.org

    A Near-Optimal Algorithm for Debiasing Trained Machine Learning Models. (arXiv:2106.12887v2 [cs.LG] UPDATED)
    (2 min) We present a scalable post-processing algorithm for debiasing trained models, including deep neural networks (DNNs), which we prove to be near-optimal by bounding its excess Bayes risk. We empirically validate its advantages on standard benchmark datasets across both classical algorithms as well as modern DNN architectures and demonstrate that it outperforms previous post-processing methods while performing on par with in-processing. In addition, we show that the proposed algorithm is particularly effective for models trained at scale where post-processing is a natural and practical choice.
    Scalable semi-supervised dimensionality reduction with GPU-accelerated EmbedSOM. (arXiv:2201.00701v1 [cs.LG])
    (2 min) Dimensionality reduction methods have found vast application as visualization tools in diverse areas of science. Although many different methods exist, their performance is often insufficient for providing quick insight into many contemporary datasets, and the unsupervised mode of use prevents the users from utilizing the methods for dataset exploration and fine-tuning the details for improved visualization quality. We present BlosSOM, a high-performance semi-supervised dimensionality reduction software for interactive user-steerable visualization of high-dimensional datasets with millions of individual data points. BlosSOM builds on a GPU-accelerated implementation of the EmbedSOM algorithm, complemented by several landmark-based algorithms for interfacing the unsupervised model learning algorithms with the user supervision. We show the application of BlosSOM on realistic datasets, where it helps to produce high-quality visualizations that incorporate user-specified layout and focus on certain features. We believe the semi-supervised dimensionality reduction will improve the data visualization possibilities for science areas such as single-cell cytometry, and provide a fast and efficient base methodology for new directions in dataset exploration and annotation.
    Optimal Sampling Gaps for Adaptive Submodular Maximization. (arXiv:2104.01750v4 [cs.LG] UPDATED)
    (2 min) Running machine learning algorithms on large and rapidly growing volumes of data is often computationally expensive, one common trick to reduce the size of a data set, and thus reduce the computational cost of machine learning algorithms, is \emph{probability sampling}. It creates a sampled data set by including each data point from the original data set with a known probability. Although the benefit of running machine learning algorithms on the reduced data set is obvious, one major concern is that the performance of the solution obtained from samples might be much worse than that of the optimal solution when using the full data set. In this paper, we examine the performance loss caused by probability sampling in the context of adaptive submodular maximization. We consider a simple probability sampling method which selects each data point with probability $r\in[0,1]$. If we set the sampling rate $r=1$, our problem reduces to finding a solution based on the original full data set. We define sampling gap as the largest ratio between the optimal solution obtained from the full data set and the optimal solution obtained from the samples, over independence systems. %It captures the performance loss of the optimal solution caused by the probability sampling. Our main contribution is to show that if the utility function is policywise submodular, then for a given sampling rate $r$, the sampling gap is both upper bounded and lower bounded by $1/r$. One immediate implication of our result is that if we can find an $\alpha$-approximation solution based on a sampled data set (which is sampled at sampling rate $r$), then this solution achieves an $\alpha r$ approximation ratio against the optimal solution when using the full data set.
    On the Provable Generalization of Recurrent Neural Networks. (arXiv:2109.14142v3 [cs.LG] UPDATED)
    (2 min) Recurrent Neural Network (RNN) is a fundamental structure in deep learning. Recently, some works study the training process of over-parameterized neural networks, and show that over-parameterized networks can learn functions in some notable concept classes with a provable generalization error bound. In this paper, we analyze the training and generalization for RNNs with random initialization, and provide the following improvements over recent works: 1) For a RNN with input sequence $x=(X_1,X_2,...,X_L)$, previous works study to learn functions that are summation of $f(\beta^T_lX_l)$ and require normalized conditions that $||X_l||\leq\epsilon$ with some very small $\epsilon$ depending on the complexity of $f$. In this paper, using detailed analysis about the neural tangent kernel matrix, we prove a generalization error bound to learn such functions without normalized conditions and show that some notable concept classes are learnable with the numbers of iterations and samples scaling almost-polynomially in the input length $L$. 2) Moreover, we prove a novel result to learn N-variables functions of input sequence with the form $f(\beta^T[X_{l_1},...,X_{l_N}])$, which do not belong to the "additive" concept class, i,e., the summation of function $f(X_l)$. And we show that when either $N$ or $l_0=\max(l_1,..,l_N)-\min(l_1,..,l_N)$ is small, $f(\beta^T[X_{l_1},...,X_{l_N}])$ will be learnable with the number iterations and samples scaling almost-polynomially in the input length $L$.
    Online Regularization towards Always-Valid High-Dimensional Dynamic Pricing. (arXiv:2007.02470v2 [stat.ML] UPDATED)
    (2 min) Devising dynamic pricing policy with always valid online statistical learning procedure is an important and as yet unresolved problem. Most existing dynamic pricing policy, which focus on the faithfulness of adopted customer choice models, exhibit a limited capability for adapting the online uncertainty of learned statistical model during pricing process. In this paper, we propose a novel approach for designing dynamic pricing policy based regularized online statistical learning with theoretical guarantees. The new approach overcomes the challenge of continuous monitoring of online Lasso procedure and possesses several appealing properties. In particular, we make the decisive observation that the always-validity of pricing decisions builds and thrives on the online regularization scheme. Our proposed online regularization scheme equips the proposed optimistic online regularized maximum likelihood pricing (OORMLP) pricing policy with three major advantages: encode market noise knowledge into pricing process optimism; empower online statistical learning with always-validity over all decision points; envelop prediction error process with time-uniform non-asymptotic oracle inequalities. This type of non-asymptotic inference results allows us to design more sample-efficient and robust dynamic pricing algorithms in practice. In theory, the proposed OORMLP algorithm exploits the sparsity structure of high-dimensional models and secures a logarithmic regret in a decision horizon. These theoretical advances are made possible by proposing an optimistic online Lasso procedure that resolves dynamic pricing problems at the process level, based on a novel use of non-asymptotic martingale concentration. In experiments, we evaluate OORMLP in different synthetic and real pricing problem settings, and demonstrate that OORMLP advances the state-of-the-art methods.
    Conditional Selective Inference for Robust Regression and Outlier Detection using Piecewise-Linear Homotopy Continuation. (arXiv:2104.10840v2 [stat.ML] UPDATED)
    (2 min) In practical data analysis under noisy environment, it is common to first use robust methods to identify outliers, and then to conduct further analysis after removing the outliers. In this paper, we consider statistical inference of the model estimated after outliers are removed, which can be interpreted as a selective inference (SI) problem. To use conditional SI framework, it is necessary to characterize the events of how the robust method identifies outliers. Unfortunately, the existing methods cannot be directly used here because they are applicable to the case where the selection events can be represented by linear/quadratic constraints. In this paper, we propose a conditional SI method for popular robust regressions by using homotopy method. We show that the proposed conditional SI method is applicable to a wide class of robust regression and outlier detection methods and has good empirical performance on both synthetic data and real data experiments.
    Application of Machine Learning Methods in Inferring Surface Water Groundwater Exchanges using High Temporal Resolution Temperature Measurements. (arXiv:2201.00726v1 [cs.LG])
    (2 min) We examine the ability of machine learning (ML) and deep learning (DL) algorithms to infer surface/ground exchange flux based on subsurface temperature observations. The observations and fluxes are produced from a high-resolution numerical model representing conditions in the Columbia River near the Department of Energy Hanford site located in southeastern Washington State. Random measurement error, of varying magnitude, is added to the synthetic temperature observations. The results indicate that both ML and DL methods can be used to infer the surface/ground exchange flux. DL methods, especially convolutional neural networks, outperform the ML methods when used to interpret noisy temperature data with a smoothing filter applied. However, the ML methods also performed well and they are can better identify a reduced number of important observations, which could be useful for measurement network optimization. Surprisingly, the ML and DL methods better inferred upward flux than downward flux. This is in direct contrast to previous findings using numerical models to infer flux from temperature observations and it may suggest that combined use of ML or DL inference with numerical inference could improve flux estimation beneath river systems.
    Provable Meta-Learning of Linear Representations. (arXiv:2002.11684v5 [cs.LG] UPDATED)
    (2 min) Meta-learning, or learning-to-learn, seeks to design algorithms that can utilize previous experience to rapidly learn new skills or adapt to new environments. Representation learning -- a key tool for performing meta-learning -- learns a data representation that can transfer knowledge across multiple tasks, which is essential in regimes where data is scarce. Despite a recent surge of interest in the practice of meta-learning, the theoretical underpinnings of meta-learning algorithms are lacking, especially in the context of learning transferable representations. In this paper, we focus on the problem of multi-task linear regression -- in which multiple linear regression models share a common, low-dimensional linear representation. Here, we provide provably fast, sample-efficient algorithms to address the dual challenges of (1) learning a common set of features from multiple, related tasks, and (2) transferring this knowledge to new, unseen tasks. Both are central to the general problem of meta-learning. Finally, we complement these results by providing information-theoretic lower bounds on the sample complexity of learning these linear features.
    Statistical Inference with Local Optima. (arXiv:1807.04431v2 [math.ST] UPDATED)
    (2 min) We study the statistical properties of an estimator derived by applying a gradient ascent method with multiple initializations to a multi-modal likelihood function. We derive the population quantity that is the target of this estimator and study the properties of confidence intervals (CIs) constructed from asymptotic normality and the bootstrap approach. In particular, we analyze the coverage deficiency due to finite number of random initializations. We also investigate the CIs by inverting the likelihood ratio test, the score test, and the Wald test, and we show that the resulting CIs may be very different. We propose a two-sample test procedure even when the MLE is intractable. In addition, we analyze the performance of the EM algorithm under random initializations and derive the coverage of a CI with a finite number of initializations.
    Kernel Two-Sample Tests in High Dimension: Interplay Between Moment Discrepancy and Dimension-and-Sample Orders. (arXiv:2201.00073v1 [math.ST])
    (2 min) Motivated by the increasing use of kernel-based metrics for high-dimensional and large-scale data, we study the asymptotic behavior of kernel two-sample tests when the dimension and sample sizes both diverge to infinity. We focus on the maximum mean discrepancy (MMD) with the kernel of the form $k(x,y)=f(\|x-y\|_{2}^{2}/\gamma)$, including MMD with the Gaussian kernel and the Laplacian kernel, and the energy distance as special cases. We derive asymptotic expansions of the kernel two-sample statistics, based on which we establish the central limit theorem (CLT) under both the null hypothesis and the local and fixed alternatives. The new non-null CLT results allow us to perform asymptotic exact power analysis, which reveals a delicate interplay between the moment discrepancy that can be detected by the kernel two-sample tests and the dimension-and-sample orders. The asymptotic theory is further corroborated through numerical studies. Our findings complement those in the recent literature and shed new light on the use of kernel two-sample tests for high-dimensional and large-scale data.
    On Robust Probabilistic Principal Component Analysis using Multivariate $t$-Distributions. (arXiv:2010.10786v2 [stat.ME] UPDATED)
    (2 min) Probabilistic principal component analysis (PPCA) is a probabilistic reformulation of principal component analysis (PCA), under the framework of a Gaussian latent variable model. To improve the robustness of PPCA, it has been proposed to change the underlying Gaussian distributions to multivariate $t$-distributions. Based on the representation of $t$-distribution as a scale mixture of Gaussian distributions, a hierarchical model is used for implementation. However, in the existing literature, the hierarchical model implemented does not yield the equivalent interpretation. In this paper, we present two sets of equivalent relationships between the high-level multivariate $t$-PPCA framework and the hierarchical model used for implementation. In doing so, we clarify a current misrepresentation in the literature, by specifying the correct correspondence. In addition, we discuss the performance of different multivariate $t$ robust PPCA methods both in theory and simulation studies, and propose a novel Monte Carlo expectation-maximization (MCEM) algorithm to implement one general type of such models.
    Deep Nonparametric Estimation of Operators between Infinite Dimensional Spaces. (arXiv:2201.00217v1 [stat.ML])
    (2 min) Learning operators between infinitely dimensional spaces is an important learning task arising in wide applications in machine learning, imaging science, mathematical modeling and simulations, etc. This paper studies the nonparametric estimation of Lipschitz operators using deep neural networks. Non-asymptotic upper bounds are derived for the generalization error of the empirical risk minimizer over a properly chosen network class. Under the assumption that the target operator exhibits a low dimensional structure, our error bounds decay as the training sample size increases, with an attractive fast rate depending on the intrinsic dimension in our estimation. Our assumptions cover most scenarios in real applications and our results give rise to fast rates by exploiting low dimensional structures of data in operator estimation. We also investigate the influence of network structures (e.g., network width, depth, and sparsity) on the generalization error of the neural network estimator and propose a general suggestion on the choice of network structures to maximize the learning efficiency quantitatively.
    Global convergence of optimized adaptive importance samplers. (arXiv:2201.00409v1 [stat.CO])
    (2 min) We analyze the optimized adaptive importance sampler (OAIS) for performing Monte Carlo integration with general proposals. We leverage a classical result which shows that the bias and the mean-squared error (MSE) of the importance sampling scales with the $\chi^2$-divergence between the target and the proposal and develop a scheme which performs global optimization of $\chi^2$-divergence. While it is known that this quantity is convex for exponential family proposals, the case of the general proposals has been an open problem. We close this gap by utilizing stochastic gradient Langevin dynamics (SGLD) and its underdamped counterpart for the global optimization of $\chi^2$-divergence and derive nonasymptotic bounds for the MSE by leveraging recent results from non-convex optimization literature. The resulting AIS schemes have explicit theoretical guarantees uniform in the number of iterations.
    ECOD: Unsupervised Outlier Detection Using Empirical Cumulative Distribution Functions. (arXiv:2201.00382v1 [cs.LG])
    (2 min) Outlier detection refers to the identification of data points that deviate from a general data distribution. Existing unsupervised approaches often suffer from high computational cost, complex hyperparameter tuning, and limited interpretability, especially when working with large, high-dimensional datasets. To address these issues, we present a simple yet effective algorithm called ECOD (Empirical-Cumulative-distribution-based Outlier Detection), which is inspired by the fact that outliers are often the "rare events" that appear in the tails of a distribution. In a nutshell, ECOD first estimates the underlying distribution of the input data in a nonparametric fashion by computing the empirical cumulative distribution per dimension of the data. ECOD then uses these empirical distributions to estimate tail probabilities per dimension for each data point. Finally, ECOD computes an outlier score of each data point by aggregating estimated tail probabilities across dimensions. Our contributions are as follows: (1) we propose a novel outlier detection method called ECOD, which is both parameter-free and easy to interpret; (2) we perform extensive experiments on 30 benchmark datasets, where we find that ECOD outperforms 11 state-of-the-art baselines in terms of accuracy, efficiency, and scalability; and (3) we release an easy-to-use and scalable (with distributed support) Python implementation for accessibility and reproducibility.
    Cluster Stability Selection. (arXiv:2201.00494v1 [stat.ME])
    (2 min) Stability selection (Meinshausen and Buhlmann, 2010) makes any feature selection method more stable by returning only those features that are consistently selected across many subsamples. We prove (in what is, to our knowledge, the first result of its kind) that for data containing highly correlated proxies for an important latent variable, the lasso typically selects one proxy, yet stability selection with the lasso can fail to select any proxy, leading to worse predictive performance than the lasso alone. We introduce cluster stability selection, which exploits the practitioner's knowledge that highly correlated clusters exist in the data, resulting in better feature rankings than stability selection in this setting. We consider several feature-combination approaches, including taking a weighted average of the features in each important cluster where weights are determined by the frequency with which cluster members are selected, which we show leads to better predictive models than previous proposals. We present generalizations of theoretical guarantees from Meinshausen and Buhlmann (2010) and Shah and Samworth (2012) to show that cluster stability selection retains the same guarantees. In summary, cluster stability selection enjoys the best of both worlds, yielding a sparse selected set that is both stable and has good predictive performance.
    Risk Bounds for Over-parameterized Maximum Margin Classification on Sub-Gaussian Mixtures. (arXiv:2104.13628v2 [cs.LG] UPDATED)
    (2 min) Modern machine learning systems such as deep neural networks are often highly over-parameterized so that they can fit the noisy training data exactly, yet they can still achieve small test errors in practice. In this paper, we study this "benign overfitting" phenomenon of the maximum margin classifier for linear classification problems. Specifically, we consider data generated from sub-Gaussian mixtures, and provide a tight risk bound for the maximum margin linear classifier in the over-parameterized setting. Our results precisely characterize the condition under which benign overfitting can occur in linear classification problems, and improve on previous work. They also have direct implications for over-parameterized logistic regression.
    An Ensemble of Pre-trained Transformer Models For Imbalanced Multiclass Malware Classification. (arXiv:2112.13236v2 [cs.CR] UPDATED)
    (2 min) Classification of malware families is crucial for a comprehensive understanding of how they can infect devices, computers, or systems. Thus, malware identification enables security researchers and incident responders to take precautions against malware and accelerate mitigation. API call sequences made by malware are widely utilized features by machine and deep learning models for malware classification as these sequences represent the behavior of malware. However, traditional machine and deep learning models remain incapable of capturing sequence relationships between API calls. On the other hand, the transformer-based models process sequences as a whole and learn relationships between API calls due to multi-head attention mechanisms and positional embeddings. Our experiments demonstrate that the transformer model with one transformer block layer surpassed the widely used base architecture, LSTM. Moreover, BERT or CANINE, pre-trained transformer models, outperformed in classifying highly imbalanced malware families according to evaluation metrics, F1-score, and AUC score. Furthermore, the proposed bagging-based random transformer forest (RTF), an ensemble of BERT or CANINE, has reached the state-of-the-art evaluation scores on three out of four datasets, particularly state-of-the-art F1-score of 0.6149 on one of the commonly used benchmark dataset.
    Covariate-assisted Sparse Tensor Completion. (arXiv:2103.06428v2 [stat.ML] UPDATED)
    (2 min) We aim to provably complete a sparse and highly-missing tensor in the presence of covariate information along tensor modes. Our motivation comes from online advertising where users click-through-rates (CTR) on ads over various devices form a CTR tensor that has about 96% missing entries and has many zeros on non-missing entries, which makes the standalone tensor completion method unsatisfactory. Beside the CTR tensor, additional ad features or user characteristics are often available. In this paper, we propose Covariate-assisted Sparse Tensor Completion (COSTCO) to incorporate covariate information for the recovery of the sparse tensor. The key idea is to jointly extract latent components from both the tensor and the covariate matrix to learn a synthetic representation. Theoretically, we derive the error bound for the recovered tensor components and explicitly quantify the improvements on both the reveal probability condition and the tensor recovery accuracy due to covariates. Finally, we apply COSTCO to an advertisement dataset consisting of a CTR tensor and ad covariate matrix, leading to 23% accuracy improvement over the baseline. An important by-product is that ad latent components from COSTCO reveal interesting ad clusters, which are useful for better ad targeting.
    DSelect-k: Differentiable Selection in the Mixture of Experts with Applications to Multi-Task Learning. (arXiv:2106.03760v3 [cs.LG] UPDATED)
    (2 min) The Mixture-of-Experts (MoE) architecture is showing promising results in improving parameter sharing in multi-task learning (MTL) and in scaling high-capacity neural networks. State-of-the-art MoE models use a trainable sparse gate to select a subset of the experts for each input example. While conceptually appealing, existing sparse gates, such as Top-k, are not smooth. The lack of smoothness can lead to convergence and statistical performance issues when training with gradient-based methods. In this paper, we develop DSelect-k: a continuously differentiable and sparse gate for MoE, based on a novel binary encoding formulation. The gate can be trained using first-order methods, such as stochastic gradient descent, and offers explicit control over the number of experts to select. We demonstrate the effectiveness of DSelect-k on both synthetic and real MTL datasets with up to $128$ tasks. Our experiments indicate that DSelect-k can achieve statistically significant improvements in prediction and expert selection over popular MoE gates. Notably, on a real-world, large-scale recommender system, DSelect-k achieves over $22\%$ improvement in predictive performance compared to Top-k. We provide an open-source implementation of DSelect-k.
    Class-Incremental Continual Learning into the eXtended DER-verse. (arXiv:2201.00766v1 [cs.LG])
    (2 min) The staple of human intelligence is the capability of acquiring knowledge in a continuous fashion. In stark contrast, Deep Networks forget catastrophically and, for this reason, the sub-field of Class-Incremental Continual Learning fosters methods that learn a sequence of tasks incrementally, blending sequentially-gained knowledge into a comprehensive prediction. This work aims at assessing and overcoming the pitfalls of our previous proposal Dark Experience Replay (DER), a simple and effective approach that combines rehearsal and Knowledge Distillation. Inspired by the way our minds constantly rewrite past recollections and set expectations for the future, we endow our model with the abilities to i) revise its replay memory to welcome novel information regarding past data ii) pave the way for learning yet unseen classes. We show that the application of these strategies leads to remarkable improvements; indeed, the resulting method - termed eXtended-DER (X-DER) - outperforms the state of the art on both standard benchmarks (such as CIFAR-100 and miniImagenet) and a novel one here introduced. To gain a better understanding, we further provide extensive ablation studies that corroborate and extend the findings of our previous research (e.g. the value of Knowledge Distillation and flatter minima in continual learning setups).
    Neural Network Middle-Term Probabilistic Forecasting of Daily Power Consumption. (arXiv:2006.16388v2 [stat.ME] UPDATED)
    (2 min) Middle-term horizon (months to a year) power consumption prediction is a main challenge in the energy sector, in particular when probabilistic forecasting is considered. We propose a new modelling approach that incorporates trend, seasonality and weather conditions, as explicative variables in a shallow Neural Network with an autoregressive feature. We obtain excellent results for density forecast on the one-year test set applying it to the daily power consumption in New England U.S.A.. The quality of the achieved power consumption probabilistic forecasting has been verified, on the one hand, comparing the results to other standard models for density forecasting and, on the other hand, considering measures that are frequently used in the energy sector as pinball loss and CI backtesting.
    Identify comorbidities associated with recurrent ED and in-patient visits. (arXiv:2110.13769v2 [stat.ML] UPDATED)
    (2 min) In the hospital setting, a small percentage of recurrent frequent patients contribute to a disproportional amount of healthcare resource usage. Moreover, in many of these cases, patient outcomes can be greatly improved by reducing reoccurring visits, especially when they are associated with substance abuse, mental health, and medical factors that could be improved by social-behavioral interventions, outpatient or preventative care. Additionally, health care costs can be reduced significantly with fewer preventable recurrent visits. To address this, we developed a computationally efficient and interpretable framework that both identifies recurrent patients with high utilization and determines which comorbidities contribute most to their recurrent visits. Specifically, we present a novel algorithm, called the minimum similarity association rules (MSAR), balancing confidence-support trade-off, to determine the conditions most associated with reoccurring Emergency department (ED) and inpatient visits. We validate MSAR on a large Electric Health Record (EHR) dataset.
    Linear Classifiers in Product Space Forms. (arXiv:2102.10204v3 [cs.LG] UPDATED)
    (2 min) Embedding methods for product spaces are powerful techniques for low-distortion and low-dimensional representation of complex data structures. Here, we address the new problem of linear classification in product space forms -- products of Euclidean, spherical, and hyperbolic spaces. First, we describe novel formulations for linear classifiers on a Riemannian manifold using geodesics and Riemannian metrics which generalize straight lines and inner products in vector spaces. Second, we prove that linear classifiers in $d$-dimensional space forms of any curvature have the same expressive power, i.e., they can shatter exactly $d+1$ points. Third, we formalize linear classifiers in product space forms, describe the first known perceptron and support vector machine classifiers for such spaces and establish rigorous convergence results for perceptrons. Moreover, we prove that the Vapnik-Chervonenkis dimension of linear classifiers in a product space form of dimension $d$ is \emph{at least} $d+1$. We support our theoretical findings with simulations on several datasets, including synthetic data, image data, and single-cell RNA sequencing (scRNA-seq) data. The results show that classification in low-dimensional product space forms for scRNA-seq data offers, on average, a performance improvement of $\sim15\%$ when compared to that in Euclidean spaces of the same dimension.
    Provably Efficient Reinforcement Learning with Linear Function Approximation Under Adaptivity Constraints. (arXiv:2101.02195v2 [cs.LG] UPDATED)
    (2 min) We study reinforcement learning (RL) with linear function approximation under the adaptivity constraint. We consider two popular limited adaptivity models: the batch learning model and the rare policy switch model, and propose two efficient online RL algorithms for episodic linear Markov decision processes, where the transition probability and the reward function can be represented as a linear function of some known feature mapping. In specific, for the batch learning model, our proposed LSVI-UCB-Batch algorithm achieves an $\tilde O(\sqrt{d^3H^3T} + dHT/B)$ regret, where $d$ is the dimension of the feature mapping, $H$ is the episode length, $T$ is the number of interactions and $B$ is the number of batches. Our result suggests that it suffices to use only $\sqrt{T/dH}$ batches to obtain $\tilde O(\sqrt{d^3H^3T})$ regret. For the rare policy switch model, our proposed LSVI-UCB-RareSwitch algorithm enjoys an $\tilde O(\sqrt{d^3H^3T[1+T/(dH)]^{dH/B}})$ regret, which implies that $dH\log T$ policy switches suffice to obtain the $\tilde O(\sqrt{d^3H^3T})$ regret. Our algorithms achieve the same regret as the LSVI-UCB algorithm (Jin et al., 2019), yet with a substantially smaller amount of adaptivity. We also establish a lower bound for the batch learning model, which suggests that the dependency on $B$ in our regret bound is tight.
    Boosting the Certified Robustness of L-infinity Distance Nets. (arXiv:2110.06850v3 [cs.LG] UPDATED)
    (2 min) Recently, Zhang et al.(2021) developed a new neural network architecture based on $\ell_\infty$-distance functions, which naturally possesses certified $\ell_\infty$ robustness by its construction. Despite rigorous theoretical guarantees, the model so far can only achieve comparable performance to conventional networks. In this paper, we make the following two contributions: $\mathrm{(i)}$ We demonstrate that $\ell_\infty$-distance nets enjoy a fundamental advantage in certified robustness over conventional networks (under typical certification approaches); $\mathrm{(ii)}$ With an improved training process we are able to significantly boost the certified accuracy of $\ell_\infty$-distance nets. Our training approach largely alleviates the optimization problem that arose in the previous training scheme, in particular, the unexpected large Lipschitz constant due to the use of a crucial trick called $\ell_p$-relaxation. The core of our training approach is a novel objective function that combines scaled cross-entropy loss and clipped hinge loss with a decaying mixing coefficient. Experiments show that using the proposed training strategy, the certified accuracy of $\ell_\infty$-distance net can be dramatically improved from 33.30% to 40.06% on CIFAR-10 ($\epsilon=8/255$), meanwhile outperforming other approaches in this area by a large margin. Our results clearly demonstrate the effectiveness and potential of $\ell_\infty$-distance net for certified robustness. Codes are available at https://github.com/zbh2047/L_inf-dist-net-v2.
    A General Framework for Treatment Effect Estimation in Semi-Supervised and High Dimensional Settings. (arXiv:2201.00468v1 [stat.ME])
    (2 min) In this article, we aim to provide a general and complete understanding of semi-supervised (SS) causal inference for treatment effects. Specifically, we consider two such estimands: (a) the average treatment effect and (b) the quantile treatment effect, as prototype cases, in an SS setting, characterized by two available data sets: (i) a labeled data set of size $n$, providing observations for a response and a set of high dimensional covariates, as well as a binary treatment indicator; and (ii) an unlabeled data set of size $N$, much larger than $n$, but without the response observed. Using these two data sets, we develop a family of SS estimators which are ensured to be: (1) more robust and (2) more efficient than their supervised counterparts based on the labeled data set only. Beyond the 'standard' double robustness results (in terms of consistency) that can be achieved by supervised methods as well, we further establish root-n consistency and asymptotic normality of our SS estimators whenever the propensity score in the model is correctly specified, without requiring specific forms of the nuisance functions involved. Such an improvement of robustness arises from the use of the massive unlabeled data, so it is generally not attainable in a purely supervised setting. In addition, our estimators are shown to be semi-parametrically efficient as long as all the nuisance functions are correctly specified. Moreover, as an illustration of the nuisance estimators, we consider inverse-probability-weighting type kernel smoothing estimators involving unknown covariate transformation mechanisms, and establish in high dimensional scenarios novel results on their uniform convergence rates, which should be of independent interest. Numerical results on both simulated and real data validate the advantage of our methods over their supervised counterparts with respect to both robustness and efficiency.
    Generalization Bounds for Noisy Iterative Algorithms Using Properties of Additive Noise Channels. (arXiv:2102.02976v3 [stat.ML] UPDATED)
    (2 min) Machine learning models trained by different optimization algorithms under different data distributions can exhibit distinct generalization behaviors. In this paper, we analyze the generalization of models trained by noisy iterative algorithms. We derive distribution-dependent generalization bounds by connecting noisy iterative algorithms to additive noise channels found in communication and information theory. Our generalization bounds shed light on several applications, including differentially private stochastic gradient descent (DP-SGD), federated learning, and stochastic gradient Langevin dynamics (SGLD). We demonstrate our bounds through numerical experiments, showing that they can help understand recent empirical observations of the generalization phenomena of neural networks.
    Minimum Excess Risk in Bayesian Learning. (arXiv:2012.14868v2 [cs.LG] UPDATED)
    (2 min) We analyze the best achievable performance of Bayesian learning under generative models by defining and upper-bounding the minimum excess risk (MER): the gap between the minimum expected loss attainable by learning from data and the minimum expected loss that could be achieved if the model realization were known. The definition of MER provides a principled way to define different notions of uncertainties in Bayesian learning, including the aleatoric uncertainty and the minimum epistemic uncertainty. Two methods for deriving upper bounds for the MER are presented. The first method, generally suitable for Bayesian learning with a parametric generative model, upper-bounds the MER by the conditional mutual information between the model parameters and the quantity being predicted given the observed data. It allows us to quantify the rate at which the MER decays to zero as more data becomes available. Under realizable models, this method also relates the MER to the richness of the generative function class, notably the VC dimension in binary classification. The second method, particularly suitable for Bayesian learning with a parametric predictive model, relates the MER to the minimum estimation error of the model parameters from data. It explicitly shows how the uncertainty in model parameter estimation translates to the MER and to the final prediction uncertainty. We also extend the definition and analysis of MER to the setting with multiple model families and the setting with nonparametric models. Along the discussions we draw some comparisons between the MER in Bayesian learning and the excess risk in frequentist learning.
    Nearly Minimax Optimal Reinforcement Learning for Discounted MDPs. (arXiv:2010.00587v3 [cs.LG] UPDATED)
    (2 min) We study the reinforcement learning problem for discounted Markov Decision Processes (MDPs) under the tabular setting. We propose a model-based algorithm named UCBVI-$\gamma$, which is based on the \emph{optimism in the face of uncertainty principle} and the Bernstein-type bonus. We show that UCBVI-$\gamma$ achieves an $\tilde{O}\big({\sqrt{SAT}}/{(1-\gamma)^{1.5}}\big)$ regret, where $S$ is the number of states, $A$ is the number of actions, $\gamma$ is the discount factor and $T$ is the number of steps. In addition, we construct a class of hard MDPs and show that for any algorithm, the expected regret is at least $\tilde{\Omega}\big({\sqrt{SAT}}/{(1-\gamma)^{1.5}}\big)$. Our upper bound matches the minimax lower bound up to logarithmic factors, which suggests that UCBVI-$\gamma$ is nearly minimax optimal for discounted MDPs.
    The Flip Side of the Reweighted Coin: Duality of Adaptive Dropout and Regularization. (arXiv:2106.07769v3 [cs.LG] UPDATED)
    (2 min) Among the most successful methods for sparsifying deep (neural) networks are those that adaptively mask the network weights throughout training. By examining this masking, or dropout, in the linear case, we uncover a duality between such adaptive methods and regularization through the so-called "$\eta$-trick" that casts both as iteratively reweighted optimizations. We show that any dropout strategy that adapts to the weights in a monotonic way corresponds to an effective subquadratic regularization penalty, and therefore leads to sparse solutions. We obtain the effective penalties for several popular sparsification strategies, which are remarkably similar to classical penalties commonly used in sparse optimization. Considering variational dropout as a case study, we demonstrate similar empirical behavior between the adaptive dropout method and classical methods on the task of deep network sparsification, validating our theory.
    Hands-on Bayesian Neural Networks -- a Tutorial for Deep Learning Users. (arXiv:2007.06823v3 [cs.LG] UPDATED)
    (2 min) Modern deep learning methods constitute incredibly powerful tools to tackle a myriad of challenging problems. However, since deep learning methods operate as black boxes, the uncertainty associated with their predictions is often challenging to quantify. Bayesian statistics offer a formalism to understand and quantify the uncertainty associated with deep neural network predictions. This tutorial provides an overview of the relevant literature and a complete toolset to design, implement, train, use and evaluate Bayesian Neural Networks, i.e. Stochastic Artificial Neural Networks trained using Bayesian methods.
    Recover the spectrum of covariance matrix: a non-asymptotic iterative method. (arXiv:2201.00230v1 [stat.ML])
    (2 min) It is well known the sample covariance has a consistent bias in the spectrum, for example spectrum of Wishart matrix follows the Marchenko-Pastur law. We in this work introduce an iterative algorithm 'Concent' that actively eliminate this bias and recover the true spectrum for small and moderate dimensions.
    Statistical and Topological Properties of Sliced Probability Divergences. (arXiv:2003.05783v2 [stat.ML] UPDATED)
    (2 min) The idea of slicing divergences has been proven to be successful when comparing two probability measures in various machine learning applications including generative modeling, and consists in computing the expected value of a `base divergence' between one-dimensional random projections of the two measures. However, the topological, statistical, and computational consequences of this technique have not yet been well-established. In this paper, we aim at bridging this gap and derive various theoretical properties of sliced probability divergences. First, we show that slicing preserves the metric axioms and the weak continuity of the divergence, implying that the sliced divergence will share similar topological properties. We then precise the results in the case where the base divergence belongs to the class of integral probability metrics. On the other hand, we establish that, under mild conditions, the sample complexity of a sliced divergence does not depend on the problem dimension. We finally apply our general results to several base divergences, and illustrate our theory on both synthetic and real data experiments.
    Fair Data Representation for Machine Learning at the Pareto Frontier. (arXiv:2201.00292v1 [stat.ML])
    (2 min) As machine learning powered decision making is playing an increasingly important role in our daily lives, it is imperative to strive for fairness of the underlying data processing and algorithms. We propose a pre-processing algorithm for fair data representation via which L2- objective supervised learning algorithms result in an estimation of the Pareto frontier between prediction error and statistical disparity. In particular, the present work applies the optimal positive definite affine transport maps to approach the post-processing Wasserstein barycenter characterization of the optimal fair L2-objective supervised learning via a pre-processing data deformation. We call the resulting data Wasserstein pseudo-barycenter. Furthermore, we show that the Wasserstein geodesics from the learning outcome marginals to the barycenter characterizes the Pareto frontier between L2-loss and total Wasserstein distance among learning outcome marginals. Thereby, an application of McCann interpolation generalizes the pseudo-barycenter to a family of data representations via which L2-objective supervised learning algorithms result in the Pareto frontier. Numerical simulations underscore the advantages of the proposed data representation: (1) the pre-processing step is compositive with arbitrary L2-objective supervised learning methods and unseen data; (2) the fair representation protects data privacy by preventing the training machine from direct or indirect access to the sensitive information of the data; (3) the optimal affine map results in efficient computation of fair supervised learning on high-dimensional data; (4) experimental results shed light on the fairness of L2-objective unsupervised learning via the proposed fair data representation.
    Latent structure blockmodels for Bayesian spectral graph clustering. (arXiv:2107.01734v2 [stat.ML] UPDATED)
    (2 min) Spectral embedding of network adjacency matrices often produces node representations living approximately around low-dimensional submanifold structures. In particular, hidden substructure is expected to arise when the graph is generated from a latent position model. Furthermore, the presence of communities within the network might generate community-specific submanifold structures in the embedding, but this is not explicitly accounted for in most statistical models for networks. In this article, a class of models called latent structure block models (LSBM) is proposed to address such scenarios, allowing for graph clustering when community-specific one dimensional manifold structure is present. LSBMs focus on a specific class of latent space model, the random dot product graph (RDPG), and assign a latent submanifold to the latent positions of each community. A Bayesian model for the embeddings arising from LSBMs is discussed, and shown to have a good performance on simulated and real world network data. The model is able to correctly recover the underlying communities living in a one-dimensional manifold, even when the parametric form of the underlying curves is unknown, achieving remarkable results on a variety of real data.
    Parallel sequential Monte Carlo for stochastic gradient-free nonconvex optimization. (arXiv:1811.09469v4 [stat.CO] UPDATED)
    (2 min) We introduce and analyze a parallel sequential Monte Carlo methodology for the numerical solution of optimization problems that involve the minimization of a cost function that consists of the sum of many individual components. The proposed scheme is a stochastic zeroth order optimization algorithm which demands only the capability to evaluate small subsets of components of the cost function. It can be depicted as a bank of samplers that generate particle approximations of several sequences of probability measures. These measures are constructed in such a way that they have associated probability density functions whose global maxima coincide with the global minima of the original cost function. The algorithm selects the best performing sampler and uses it to approximate a global minimum of the cost function. We prove analytically that the resulting estimator converges to a global minimum of the cost function almost surely and provide explicit convergence rates in terms of the number of generated Monte Carlo samples and the dimension of the search space. We show, by way of numerical examples, that the algorithm can tackle cost functions with multiple minima or with broad "flat" regions which are hard to minimize using gradient-based techniques.
    Neural network training under semidefinite constraints. (arXiv:2201.00632v1 [cs.LG])
    (2 min) This paper is concerned with the training of neural networks (NNs) under semidefinite constraints. This type of training problems has recently gained popularity since semidefinite constraints can be used to verify interesting properties for NNs that include, e.g., the estimation of an upper bound on the Lipschitz constant, which relates to the robustness of an NN, or the stability of dynamic systems with NN controllers. The utilized semidefinite constraints are based on sector constraints satisfied by the underlying activation functions. Unfortunately, one of the biggest bottlenecks of these new results is the required computational effort for incorporating the semidefinite constraints into the training of NNs which is limiting their scalability to large NNs. We address this challenge by developing interior point methods for NN training that we implement using barrier functions for semidefinite constraints. In order to efficiently compute the gradients of the barrier terms, we exploit the structure of the semidefinite constraints. In experiments, we demonstrate the superior efficiency of our training method over previous approaches, which allows us, e.g., to use semidefinite constraints in the training of Wasserstein generative adversarial networks, where the discriminator must satisfy a Lipschitz condition.
    Deep Feature Fusion for Mitosis Counting. (arXiv:2002.03781v2 [cs.CV] UPDATED)
    (2 min) Each woman living in the United States has about 1 in 8 chance of developing invasive breast cancer. The mitotic cell count is one of the most common tests to assess the aggressiveness or grade of breast cancer. In this prognosis, histopathology images must be examined by a pathologist using high-resolution microscopes to count the cells. Unfortunately, can be an exhaustive task with poor reproducibility, especially for non-experts. Deep learning networks have recently been adapted to medical applications which are able to automatically localize these regions of interest. However, these region-based networks lack the ability to take advantage of the segmentation features produced by a full image CNN which are often used as a sole method of detection. Therefore, the proposed method leverages Faster RCNN for object detection while fusing segmentation features generated by a UNet with RGB image features to achieve an F-score of 0.508 on the MITOS-ATYPIA 2014 mitosis counting challenge dataset, outperforming state-of-the-art methods.
    Improving Actor-Critic Reinforcement Learning via Hamiltonian Monte Carlo Method. (arXiv:2103.12020v3 [cs.LG] UPDATED)
    (2 min) The actor-critic RL is widely used in various robotic control tasks. By viewing the actor-critic RL from the perspective of variational inference (VI), the policy network is trained to obtain the approximate posterior of actions given the optimality criteria. However, in practice, the actor-critic RL may yield suboptimal policy estimates due to the amortization gap and insufficient exploration. In this work, inspired by the previous use of Hamiltonian Monte Carlo (HMC) in VI, we propose to integrate the policy network of actor-critic RL with HMC, which is termed as {\it Hamiltonian Policy}. As such we propose to evolve actions from the base policy according to HMC, and our proposed method has many benefits. First, HMC can improve the policy distribution to better approximate the posterior and hence reduce the amortization gap. Second, HMC can also guide the exploration more to the regions of action spaces with higher Q values, enhancing the exploration efficiency. Further, instead of directly applying HMC into RL, we propose a new leapfrog operator to simulate the Hamiltonian dynamics. Finally, in safe RL problems, we find that the proposed method can not only improve the achieved return, but also reduce safety constraint violations by discarding potentially unsafe actions. With comprehensive empirical experiments on continuous control baselines, including MuJoCo and PyBullet Roboschool, we show that the proposed approach is a data-efficient and easy-to-implement improvement over previous actor-critic methods.
    Latent Gaussian Model Boosting. (arXiv:2105.08966v3 [cs.LG] UPDATED)
    (2 min) Latent Gaussian models and boosting are widely used techniques in statistics and machine learning. Tree-boosting shows excellent prediction accuracy on many data sets, but potential drawbacks are that it assumes conditional independence of samples, produces discontinuous predictions for, e.g., spatial data, and it can have difficulty with high-cardinality categorical variables. Latent Gaussian models, such as Gaussian process and grouped random effects models, are flexible prior models which explicitly model dependence among samples and which allow for efficient learning of predictor functions and for making probabilistic predictions. However, existing latent Gaussian models usually assume either a zero or a linear prior mean function which can be an unrealistic assumption. This article introduces a novel approach that combines boosting and latent Gaussian models to remedy the above-mentioned drawbacks and to leverage the advantages of both techniques. We obtain increased prediction accuracy compared to existing approaches in both simulated and real-world data experiments.
    Thinking inside the box: A tutorial on grey-box Bayesian optimization. (arXiv:2201.00272v1 [cs.LG])
    (2 min) Bayesian optimization (BO) is a framework for global optimization of expensive-to-evaluate objective functions. Classical BO methods assume that the objective function is a black box. However, internal information about objective function computation is often available. For example, when optimizing a manufacturing line's throughput with simulation, we observe the number of parts waiting at each workstation, in addition to the overall throughput. Recent BO methods leverage such internal information to dramatically improve performance. We call these "grey-box" BO methods because they treat objective computation as partially observable and even modifiable, blending the black-box approach with so-called "white-box" first-principles knowledge of objective function computation. This tutorial describes these methods, focusing on BO of composite objective functions, where one can observe and selectively evaluate individual constituents that feed into the overall objective; and multi-fidelity BO, where one can evaluate cheaper approximations of the objective function by varying parameters of the evaluation oracle.
    Deep-learning-based upscaling method for geologic models via theory-guided convolutional neural network. (arXiv:2201.00698v1 [cs.LG])
    (2 min) Large-scale or high-resolution geologic models usually comprise a huge number of grid blocks, which can be computationally demanding and time-consuming to solve with numerical simulators. Therefore, it is advantageous to upscale geologic models (e.g., hydraulic conductivity) from fine-scale (high-resolution grids) to coarse-scale systems. Numerical upscaling methods have been proven to be effective and robust for coarsening geologic models, but their efficiency remains to be improved. In this work, a deep-learning-based method is proposed to upscale the fine-scale geologic models, which can assist to improve upscaling efficiency significantly. In the deep learning method, a deep convolutional neural network (CNN) is trained to approximate the relationship between the coarse grid of hydraulic conductivity fields and the hydraulic heads, which can then be utilized to replace the numerical solvers while solving the flow equations for each coarse block. In addition, physical laws (e.g., governing equations and periodic boundary conditions) can also be incorporated into the training process of the deep CNN model, which is termed the theory-guided convolutional neural network (TgCNN). With the physical information considered, dependence on the data volume of training the deep learning models can be reduced greatly. Several subsurface flow cases are introduced to test the performance of the proposed deep-learning-based upscaling method, including 2D and 3D cases, and isotropic and anisotropic cases. The results show that the deep learning method can provide equivalent upscaling accuracy to the numerical method, and efficiency can be improved significantly compared to numerical upscaling.
    Operator Deep Q-Learning: Zero-Shot Reward Transferring in Reinforcement Learning. (arXiv:2201.00236v1 [cs.LG])
    (2 min) Reinforcement learning (RL) has drawn increasing interests in recent years due to its tremendous success in various applications. However, standard RL algorithms can only be applied for single reward function, and cannot adapt to an unseen reward function quickly. In this paper, we advocate a general operator view of reinforcement learning, which enables us to directly approximate the operator that maps from reward function to value function. The benefit of learning the operator is that we can incorporate any new reward function as input and attain its corresponding value function in a zero-shot manner. To approximate this special type of operator, we design a number of novel operator neural network architectures based on its theoretical properties. Our design of operator networks outperform the existing methods and the standard design of general purpose operator network, and we demonstrate the benefit of our operator deep Q-learning framework in several tasks including reward transferring for offline policy evaluation (OPE) and reward transferring for offline policy optimization in a range of tasks.
    Modeling Advection on Directed Graphs using Mat\'ern Gaussian Processes for Traffic Flow. (arXiv:2201.00001v1 [math.NA])
    (2 min) The transport of traffic flow can be modeled by the advection equation. Finite difference and finite volumes methods have been used to numerically solve this hyperbolic equation on a mesh. Advection has also been modeled discretely on directed graphs using the graph advection operator [4, 18]. In this paper, we first show that we can reformulate this graph advection operator as a finite difference scheme. We then propose the Directed Graph Advection Mat\'ern Gaussian Process (DGAMGP) model that incorporates the dynamics of this graph advection operator into the kernel of a trainable Mat\'ern Gaussian Process to effectively model traffic flow and its uncertainty as an advective process on a directed graph.
    Optimal Representations for Covariate Shift. (arXiv:2201.00057v1 [cs.LG])
    (2 min) Machine learning systems often experience a distribution shift between training and testing. In this paper, we introduce a simple variational objective whose optima are exactly the set of all representations on which risk minimizers are guaranteed to be robust to any distribution shift that preserves the Bayes predictor, e.g., covariate shifts. Our objective has two components. First, a representation must remain discriminative for the task, i.e., some predictor must be able to simultaneously minimize the source and target risk. Second, the representation's marginal support needs to be the same across source and target. We make this practical by designing self-supervised learning methods that only use unlabelled data and augmentations to train robust representations. Our objectives achieve state-of-the-art results on DomainBed, and give insights into the robustness of recent methods, such as CLIP.
  • Data Science Central

    Beginning Transition to Word Press
    (2 min) Data Science Central is transitioning over to a new platform on January 10, 2022.  In order to complete this migration, we will not be accepting any new submissions for publication after today, January 4th, 2022 at 4:00PM EST. Regular writers will be notified by email about logging into the…
    Using DSC News Feeds
    (2 min) Newsfeeds are some of the more useful features of many websites, albeit frequently underappreciated by those who see feeds as just an odd-ball bit of text. A typical newsfeed contains a wealth of information about a site and the articles on it. The post Using DSC News Feeds appeared first on Data Science Central.
    Intelligent Gateways Applications For Greenfield and Brownfield Environments
    (2 min) The Internet of Things has led to the rise of a new era of computing where intelligent applications are continuously monitored. Also, connected devices need to be efficiently controlled, thereby continuously transforming information into knowledge. By intelligently gathering and analyzing huge amounts of data, smart systems can constantly optimize business flows and automate industrial control… Read More »Intelligent Gateways Applications For Greenfield and Brownfield Environments The post Intelligent Gateways Applications For Greenfield and Brownfield Environments appeared first on Data Science Central.
    Intelligent Gateways Applications For Greenfield and Brownfield Environments
    (3 min) The Internet of Things has led to the rise of a new era of computing where intelligent applications are continuously monitored. Also, connected devices need to be efficiently controlled, thereby continuously transforming information into knowledge. By intelligently gathering and analyzing huge amounts of data, smart systems can…
    Can OTT platforms succeed with machine learning services? An insight
    (5 min) Whether you're acquainted with devices such as the Amazon Fire Stick, Chromecast, Spotify, Youtube, or SlingTV, you presumably already have a general understanding of what over-the-top (OTT) television is. Numerous…
    Trends Towards 2022
    (25 min) It's the last week of the year. The gifts have been opened (well, the ones that aren't currently still sitting in a dock in Los Angeles after being ordered in November), the cats have been teaching tree ornaments the meaning of the word gravity, and the cookies which tasted so good on Christmas Eve are getting more than a bit stale. In short, it's…
  • The Official NVIDIA Blog

    Teamwork Makes AVs Work: NVIDIA and Deloitte Deliver Turnkey Solutions for AV Developers
    (4 min) Autonomous vehicles are born in the data center, which is why NVIDIA and Deloitte are delivering a strong foundation for developers to deploy robust self-driving technology. At CES this week, the companies detailed their collaboration, which is aimed at easing the biggest pain points in AV development. Deloitte, a leading global consulting firm, is pairing Read article > The post Teamwork Makes AVs Work: NVIDIA and Deloitte Deliver Turnkey Solutions for AV Developers appeared first on The Official NVIDIA Blog.
    ‘AI Dungeon’ Creator Nick Walton Uses AI to Generate Infinite Gaming Storylines
    (3 min) What started as Nick Walton’s college hackathon project grew into AI Dungeon, a popular text adventure game with over 1.5 million users. Walton is the co-founder and CEO of Latitude, a Utah-based startup that uses AI to create unique gaming storylines. He spoke with NVIDIA AI Podcast host Noah Kravitz about how natural language processing Read article > The post ‘AI Dungeon’ Creator Nick Walton Uses AI to Generate Infinite Gaming Storylines appeared first on The Official NVIDIA Blog.
  • AWS Machine Learning Blog

    How Wix empowers customer care with AI capabilities using Amazon Transcribe
    (7 min) With over 200 million users worldwide, Wix is a leading cloud-based development platform for building fully personalized, high-quality websites. Wix makes it easy for anyone to create a beautiful and professional web presence. When Wix started, it was easy to understand user sentiment and identify product improvement opportunities because the user base was small. Such […]
    How to approach conversation design with Amazon Lex: Building and testing (Part 3)
    (11 min) In parts one and two of our guide to conversation design with Amazon Lex, we discussed how to gather requirements for your conversational AI application and draft conversational flows. In this post, we help you bring all the pieces together. You’ll learn how draft an interaction model to deliver natural conversational experiences, and how to […]
    Deploying ML models using SageMaker Serverless Inference (Preview)
    (7 min) Amazon SageMaker Serverless Inference (Preview) was recently announced at re:Invent 2021 as a new model hosting feature that lets customers serve model predictions without having to explicitly provision compute instances or configure scaling policies to handle traffic variations. Serverless Inference is a new deployment capability that complements SageMaker’s existing options for deployment that include: SageMaker […]
    Build and visualize a real-time fraud prevention system using Amazon Fraud Detector
    (12 min) We’re living in a world of everything-as-an-online-service. Service providers from almost every industry are in the race to feature the best user experience for their online channels like web portals and mobile applications. This raises a new challenge. How do we stop illegal and fraudulent behaviors without impacting typical legitimate interactions? This challenge is even […]
  • John D. Cook

    One diagram, two completely different meanings
    (1 min) I was thumbing through a new book on causal inference, The Effect by Nick Huntington-Klein, and the following diagram caught my eye. Then it made my head hurt. It looks like a category theory diagram. What’s that doing in a book on causal inference? And if it is a category theory diagram, something’s wrong. Either […] One diagram, two completely different meanings first appeared on John D. Cook.
    Memorable techniques
    (2 min) My favorite part of this post by Ian Miellis the introduction. The article is about shell commands, but the introduction brings up a more general point. … there are thousands of reusable patterns I’ve picked up … Unfortunately, I’ve forgotten about 95% of them. … The point is to reflect on what actually stuck, so […] Memorable techniques first appeared on John D. Cook.
  • Artificial Intelligence

    Log Object
    (1 min) Does anybody know what a log object is? submitted by /u/Theverybest196 [link] [comments]
    Modzy launched on Product Hunt!
    (0 min) The Modzy ModelOps platform accelerates the deployment, integration, and governance of production-ready AI. Supported by a growing community of data scientists and developers, Modzy solves the toughest problem with using AI at scale. Check out our post linked here, where you can sign up to test out our free version! submitted by /u/modzykirsten [link] [comments]
    [R] Yale & IBM Propose KerGNNs: Interpretable GNNs with Graph Kernels That Achieve SOTA-Competitive Performance
    (1 min) A research team from Yale and IBM presents Kernel Graph Neural Networks (KerGNNs), which integrate graph kernels into the message passing process of GNNs in one framework, achieving performance comparable to state-of-the-art methods and significantly improving model interpretability compared with conventional GNNs. Here is a quick read: Yale & IBM Propose KerGNNs: Interpretable GNNs with Graph Kernels That Achieve SOTA-Competitive Performance. The paper KerGNNs: Interpretable Graph Neural Networks with Graph Kernels is on arXiv. submitted by /u/Yuqing7 [link] [comments]
    How do you measure fairness without access to demographic data?
    (2 min) Hi all! I'm working on a paper about measuring algorithmic fairness in cases where you don't have direct access to demographic data (for example, if you want to see whether a lender is discriminating against a particular race but the lender is not collecting/releasing race data of loan applicants). If you have ~10 minutes and work in the ethical AI space, it would be a great help to hear from this community on whether/how often you have faced this issue in practice and what you think should be done to mitigate. Survey link is here: https://cambridge.eu.qualtrics.com/jfe/form/SV_e9czBBKDitlglaC submitted by /u/emmharv [link] [comments]
    Use this new year to start learning something new! Whether it is machine learning or piano, just give it a try for 5 minutes tonight! If it is ML-related, this video might help out, and you can start there!
    (1 min) submitted by /u/OnlyProggingForFun [link] [comments]
    Changing Gender and Race in Image Search Results With Machine Learning
    (0 min) submitted by /u/DaveBowman1975 [link] [comments]
    Phonesites: Launch Pages in Minutes from Your Phone
    (0 min) submitted by /u/belqassim [link] [comments]
    Would it be possible to create an AI which can recognize your voice in the middle of a crowd or a somewhat noisy environment?
    (1 min) I know some software can already recognize specific voices, like Siri. However, Siri in particular can’t recognize it beyond the “Hey Siri” keywords I believe. Is it possible to create an AI which can continuously recognize x-person’s voice in a group of people or a loud environment (loud restaurant, construction nearby, concert, etc…) in real time? [link] [comments]
    Dumb-yet-sentient AI
    (3 min) It seems like most people associate sentient AI with Artificial General Intelligence, or AGI, but why should they be related? Wikipedia defines sentience as the "ability to be aware of feelings and sensations", and this doesn't seem to require any specific level of intelligence, does it? Couldn't we build a sentient AI bot, for example, out of "dumb" neural networks (e.g. relatively small transformers and/or CNN/RNN models), so long as they connect together in a way that involves sensing a world around them (in whatever dimensions that world has, not necessarily the same as ours), using these sensations to construct some sort of internal "state", and identifying that state accurately enough to drive subsequent action--or interaction with the world? submitted by /u/mm_maybe [link] [comments]
  • Machine Learning

    [D] Legal use of functions in Pytorch or Tensorflow
    (1 min) Does anybody know, is it legal to use Dropout or BatchNorm from Pytorch and Tensorflow due to Google's patents of these two functions? Did some library avoided patent infringement in its implementation of those functions? submitted by /u/Nos_per [link] [comments]
    [D] Search engine for time series
    (1 min) Hi, Last year I developed a passenger flow forecasting model. Passenger flows are heavily influenced by lockdown measures and the weather, so I wanted to incorporate features relating to that into my model. Doing so, I encountered various frustrations: ​ Data providers generally focus on a specific domain (e.g. covid, weather, ecommerce). Forecasts however can be influenced by data from many domains. Finding these providers, signing up, and reading documentation is very time consuming. It takes a lot of code just to get the data you want. For example, to get covid data for my province, I first had to call a /list endpoint, then retrieve the province identifier, and then loop over the data endpoint because the maximum range was 2 months. I think the core problem is that API’s (as the …
    [D] (A paper suggests) Most Time Series Anomaly Detection Papers are Wrong
    (1 min) I just stumbled on this very nice paper [a], which will appear in AAAI-22. The title seems much too modest, they show that a random algorithm can achieve apparent SOTA results in this domain. This seems to be a stunning result, that casts doubt on the contribution of dozens of papers. For some reason, the area of Time Series Anomaly Detection seems to be the wild west of dubious papers and sloppy thinking. As an aside, there is a benchmark set of 250 datasets here [b] that can be evaluated in a way that is free of the flaw. (my post title reflects my understanding of the paper, the authors may have a different preferred claim). [a] Towards a Rigorous Evaluation of Time-series Anomaly Detection https://arxiv.org/pdf/2109.05257.pdf [b] www.cs.ucr.edu/\~eamonn/time\_series\_data\_2018/UCR\_TimeSeriesAnomalyDatasets2021.zip submitted by /u/eamonnkeogh [link] [comments]
    [D] Multi-agent Deep Reinforcement learning
    (1 min) Hello! I hope you´re doing well. I am working on a multi-agent system with MADDPG.At time t when an agent asks for a task, the other agents are busy (i.e., the busy agents are those that are still processing a task, they didn´t finish it yet).So with this configuration, in the learning phase, I don´t know how to mask the state of the busy agents when injecting the state and action pair to the critic network. Thank you. submitted by /u/GuavaAgreeable208 [link] [comments]
    [D] VQ-VAE: Are there heuristics for the number of embeddings and the embedding dimension?
    (1 min) Hi r/MachineLearning, does anyone with experience training VQ-VAEs know if there are good rules of thumb for the embedding size? E.g. given data of dimension N, use M embeddings of size P Thanks for any help! submitted by /u/Natural_Profession_8 [link] [comments]
    [D] Preparing for a Comprehensive Exam in ML
    (1 min) I am a PhD student based in Canada and have a comprehensive exam coming up in 4-6 months. This is an exam I have been nervous about since I began my PhD. I am fairly confident about the actual proposal and answering questions related to my field. What concerns me more is fundamental/background question as ML and statistics is so broad. Plus, I am a little on the older side and my memory is a little poor. Have any students here taken a comprehensive exam? If so, what was your experience and how did you prepare? Is reading/making notes from a textbook a good idea? Or is preparing a list of topics and reading extensively about them a better option? submitted by /u/ConfusedNoobie [link] [comments]
    [D] Features of features of samples to features of samples?
    (1 min) I have a mxn matrix, m observations and n variables. Along with that, I have another dataset which are the features of the variables, a nxp matrix. What could be a way to get an mxp matrix (without naive matrix multiplication)? I wish to relate the observations to the features of variable. submitted by /u/l34df4rm3r [link] [comments]
    [D] What is the format of full paper presentations in general ML conferences like IJCAI and AAAI?
    (1 min) Hi all, This is my first conference season, so I am curious about how do author of full papers (not extended abstracts or student-track, not invited speeches) present in conferences like ICML, AAAI and IJCAI? I mean, for instance: are presentations performed in a panel with 3-5 presenters using slides, or are they all presented as posters where authors stay available for a duration of time for the interested readers to show up and discuss? Or something else? submitted by /u/briannaszvenska [link] [comments]
    [D] Can you recommend funny papper like "Single Headed Attention RNN: Stop Thinking With Your Head"
    (3 min) I really enjoyed reading this for a change to the textbook papers. Abstract The leading approaches in language modeling are all obsessed with TV shows of my youth - namely Transformers and Sesame Street. Transformers this, Transformers that, and over here a bonfire worth of GPU-TPU-neuromorphic wafer scale sil- icon. We opt for the lazy path of old and proven techniques with a fancy crypto1 inspired acronym: the Single Headed Attention RNN (SHA-RNN). The author’s lone goal is to show that the entire field might have evolved a different direction if we had instead been obsessed with a slightly differ- ent acronym and slightly different result. We take a previously strong language model based only on boring LSTMs and get it to within a stone’s throw of a stone’s throw of state-of-the-art byte level language model results on enwik8. This work has undergone no intensive hyperparameter optimization and lived entirely on a commodity desktop machine that made the author’s small stu- dio apartment far too warm in the midst of a San Franciscan summer2 . The final results are achiev- able in plus or minus 24 hours on a single GPU as the author is impatient. The attention mechanism is also readily extended to large contexts with minimal computation. Take that Sesame Street. https://arxiv.org/abs/1911.11423 submitted by /u/Puzzled-Bite-8467 [link] [comments]
    [D] Normalizing flows for distributions with finit support
    (1 min) I need to learn a map from Gaussain distribution to Gamma distribution with some custom parameters. So for both distributions, I can sample and evaluate probability density. The first thing, that came to my mind is using normalizing flow. Most approaches include log target probability density evaluation in the loss function. Obviously, normalizing flow sometimes returns negative values, and this term equals to infinity. "Positivation" functions on top of the NF break bijection properties for some regions of space (if not theoretically, but numerically - defenetely). Does NF approach is inapplicable from the box for such a simple problem or I'm missing something? submitted by /u/likan_blk [link] [comments]
    [D] Methods to create monolingual language model from pretrained multilingual model
    (1 min) Apart from just fine-tuning the pretrained multilingual language model on the target language, is there anything more sophisticated that people are doing? submitted by /u/learning-machinist [link] [comments]
    [P] Blogs on fundamentals of Score-based and Diffusion Probabilistic Models
    (1 min) This two-part blog describes the theoretical fundamentals of Score based models, Diffusion Probabilstic models and their relationship. It is written to be a coherent documentation of the theoretical developements in this new class of generative model. Rigorous mathematical proofs are excluded in order to make it more readable. Sharing it in case anyone finds it useful. Part 1: Score-base models Part 2: Diffusion Probabilistic Models submitted by /u/dasayan05 [link] [comments]
    [D] Pretraining the discriminator of a Least Squares GAN
    (2 min) I am trying to train a GAN to generate human poses in 3D space using the Humans 3.6M dataset. The output of the GAN is thus the 3D coordinates of the human joints. I have been experimenting with vanilla GANs but the output is quite noisy. I am now looking into Least Squares GAN but was wondering if it is a good idea to pretrain the discriminator of a Least Squares GAN since LSGANs address the problem of vanishing gradients and loss saturation? submitted by /u/I_am_a_robot_ [link] [comments]
    [D] Minimum Corpus Size for Word Embedding Extraction
    (1 min) Dear all, I have a smallish ~100MB corpus (of historical text in a non-mainstream language), on which I want to apply word embedding. - Is that enough? Shall I only consider "frequent" words? How frequent? does it help if I do some preprocessing such as stemming etc...? - How do I choose the parameters, especially the embedding dimensionality? - Any libraries recommended? - Are there some language-agnostic, unsupervised ways to evaluate the embeddings? ​ Thanks submitted by /u/ihatebeinganonymous [link] [comments]
    [D] Ideal deep learning library
    (3 min) From researcher perspective: - What do you miss in libraries like PyTorch or TensorFlow? - What could be improved? Some possible examples: - The way how autodiff works - Debugging features - Working with axes, einops - Something that just feel awkward, inconvenient or incomplete I would very much appreciate it if you could share your thoughts on this. submitted by /u/u6785 [link] [comments]
    Predicting future labels where the future values of only some features are known (RNNs, time series)[D]
    (3 min) I've done a fair bit of work with RNNs and Time series data, but I now realize there may be a fundamental gap in my knowledge. I was reading through this and it raised some questions about the different kinds of time forecasting problems. So, this is my summary of how it all works. Please correct me if I’m wrong. Scenario 1: You want to predict the value of some stocks into the future. Let’s say you have k stocks and n days of data. You don’t have features/labels stocks, rather the input is each stocks’ current and previous values, and the output is each stocks’ future values. Your two main options are the ‘auto-regressive’ approach where you predict the values one step at a time and feed them back in, or ‘single shot’ approach where you predict all values a fixed amount of time step…
    [P] Classification with imbalanced datasets question
    (3 min) I've been working on a medical classification project with an imbalanced tabular dataset. I have 3 classes, and each class has 44, 16, and 14 rows of data respectively. When I train a random forest classifier, I see that my model is only predicting the dominant class for all test instances most of the time. How can I get around to this? Also, are there any recommendations you can give me for dealing with imbalanced datasets? Thank you submitted by /u/chazy07 [link] [comments]
    [P] I implemented Conformer: Convolution-augmented Transformer
    (1 min) I implemented Google AI's "Conformer: Convolution-augmented Transformer for Speech Recognition" paper, it achieves the best of both worlds by combining CNNs and transformers to model both local and global dependencies and improves the local inductive bias in Transformers. https://github.com/Rishit-dagli/Conformer submitted by /u/Rishit-dagli [link] [comments]
  • Reinforcement Learning

    Simulation environment to real life. Is the brain of RL still flexible enough to learn in real life env?
    (1 min) I am planning to train TD3/DDPG using a simulation environment, and then continue the learning on to a real-life environment. I hope to reduce the timestep required to converge in real-life environment as it is costly and time-consuming. I am new to RL and I am curious as to: 'Would the algorithm still be flexible enough to continue learning?' I am slightly afraid about how the algorithm is going to think that it finished learning during the simulation environment, but then when it comes to real-life environment, it would not be flexible enough to learn on top of what it has already learned. Is this a trivial concern and is something that I should just let the algorithm learn by itself submitted by /u/KoreaNuclear [link] [comments]
    "Finding General Equilibria in Many-Agent Economic Simulations Using Deep Reinforcement Learning", Curry et al 2022
    (1 min) submitted by /u/gwern [link] [comments]
    Final reward! Help!
    (1 min) Hellooo, I have a question and I´ll be glad if someone could help me. The question is : in episodic tasks, can we work with two rewards, one during the steps of an episode and the other as the final reward at the end of the episode? Thank you! submitted by /u/LeatherCredit7148 [link] [comments]
    Some questions on Deep Reinforcement Learning (DRL)
    (2 min) Hello, I hope you are doing well. I am working on DRL and there´re still some unclear points for me: - How we tune the hyperparameters of the network (is there a method to simplify the task) - How to know that we formulated the best state for the agent? - In the case of collaborative multi-agent system, when an agent selects a task and the other agents are busy, will the reward be 0 for the busy agents or will be the reward of the agent that selects the task? submitted by /u/GuavaAgreeable208 [link] [comments]
    Workshops on AI and RL by Shaastra, IIT Madras
    (1 min) Workshops from Shaastra, IIT Madras about AI and Reinforcement Learning Certificates and recordings will be provided on registering in Shaastra's Website submitted by /u/RowEmbarrassed4756 [link] [comments]
    Real life Reinforcement learning
    (1 min) submitted by /u/odinnotdoit [link] [comments]
    Equivalent framework as Gin for PyTorch
    (0 min) Hi, does any of you know is there is an equivalent of this framework (https://github.com/google/gin-config) for PyTorch? Thanks submitted by /u/No_Possibility_7588 [link] [comments]
    Scalar reward is not enough
    (1 min) Check out this paper which discusses the idea that a scalar reward is not enough to create agi. https://arxiv.org/abs/2112.15422 What are your thoughts on this? submitted by /u/Longjumping-Chart-34 [link] [comments]

2022-01-04

  • Data Science Central

    DSC Weekly Digest 28 December 2021: An Auld Lang Syne (and Cosyne Too)
    (7 min) The last week of the year has traditionally been about reflections and planning, taking stock of the old (or auld), and preparing for the changes of the new. I've added my own thoughts about what 2022 will…
    Why a Data Science Career Is Worth Pursuing
    (4 min) “There were five exabytes of information created between the dawn of civilization through 2003, but that much information is now created every two days.” ~Eric Schmidt (Executive Chairman at Google) The term data science was first…
    Top Social Media Content Moderation Trends that Will Reign Supreme in 2022
    (4 min) The technique of managing desired information from online platforms such as social media networking sites is known as content moderation. It's also known as social moderation, and it's used to control various forms of content that aren't appropriate for a general audience.…
    Could we Live in a Universe with Fewer than Three Dimensions?
    (6 min) No, I am not a flat-earther. Quite the contrary, the ideas suggested here assume advanced scientific knowledge, such as quantum physics. The famous Cambridge-based physicist Stephen Hawking suggested that we may live in a universe with 11 dimensions. What I discuss here is not conflicting with that statement, as we shall see.  It all started…
    Data science and analytics: the future implications
    (4 min) Data science is a domain that combines data-bound analytical techniques and scientific theory to generate insights for business stakeholders. Its shape, elements, and size allow organizations to optimize operations, identify new business opportunities, and reduce the functional performance of departments such as marketing and sales. Simply put,…
  • The Official NVIDIA Blog

    NVIDIA Builds Isaac AMR Platform to Aid $9 Trillion Logistics Industry
    (4 min) Manufacturing and fulfillment centers are profoundly complex. Whenever new earbuds or socks land at your doorstep in hours or a vehicle rolls off an assembly line, a maze of magic happens with AI-driven logistics. Massive facilities like these are constantly in flux. Robots travel miles of aisles to roll up millions of products to assist Read article > The post NVIDIA Builds Isaac AMR Platform to Aid $9 Trillion Logistics Industry appeared first on The Official NVIDIA Blog.
    Gamers, Creators, Drivers Feel GeForce RTX, NVIDIA AI Everywhere
    (6 min) Putting the power of graphics and AI at the fingertips of more users than ever, NVIDIA announced today new laptops and autonomous vehicles using GeForce RTX and NVIDIA AI platforms and expanded reach for GeForce NOW cloud gaming across Samsung TVs and the AT&T network. A virtual address prior to CES showed next-gen games, new Read article > The post Gamers, Creators, Drivers Feel GeForce RTX, NVIDIA AI Everywhere appeared first on The Official NVIDIA Blog.
    NVIDIA Canvas Updated With New AI Model Delivering 4x Resolution and More Materials
    (2 min) As art evolves and artists’ abilities grow, so must their creative tools. NVIDIA Canvas, the free beta app and part of the NVIDIA Studio suite of creative apps and tools, has brought the real-time painting tool GauGAN to anyone with an NVIDIA RTX GPU. Artists use advanced AI to quickly turn simple brushstrokes into realistic Read article > The post NVIDIA Canvas Updated With New AI Model Delivering 4x Resolution and More Materials appeared first on The Official NVIDIA Blog.
    Groundbreaking Updates to NVIDIA Studio Power the 3D Virtual Worlds of Tomorrow, Today
    (4 min) We’re at the dawn of the next digital frontier. Creativity is fueling new developments in design, innovation and virtual worlds. For the creators driving this future, we’ve built NVIDIA Studio, a fully accelerated platform with high-performance GPUs as the heartbeat for laptops and desktops. This hardware is paired with exclusive NVIDIA RTX-accelerated software optimizations in Read article > The post Groundbreaking Updates to NVIDIA Studio Power the 3D Virtual Worlds of Tomorrow, Today appeared first on The Official NVIDIA Blog.
    NVIDIA Makes Free Version of Omniverse Available to Millions of Individual Creators and Artists Worldwide
    (4 min) Designed to be the foundation that connects virtual worlds, NVIDIA Omniverse is now available to millions of individual NVIDIA Studio creators using GeForce RTX and NVIDIA RTX GPUs. In a special address at CES, NVIDIA also announced new platform developments for Omniverse Machinima and Omniverse Audio2Face, new platform features like Nucleus Cloud and 3D marketplaces, Read article > The post NVIDIA Makes Free Version of Omniverse Available to Millions of Individual Creators and Artists Worldwide appeared first on The Official NVIDIA Blog.
    Autonomous Era Arrives at CES 2022 With NVIDIA DRIVE Hyperion and Omniverse Avatar
    (4 min) CES has long been a showcase on what’s coming down the technology pipeline. This year, NVIDIA is showing the radical innovation happening now. During a special virtual address at the show, Ali Kani, vice president and general manager of Automotive at NVIDIA, detailed the capabilities of DRIVE Hyperion and the many ways the industry is Read article > The post Autonomous Era Arrives at CES 2022 With NVIDIA DRIVE Hyperion and Omniverse Avatar appeared first on The Official NVIDIA Blog.
    GeForce NOW Delivers Legendary GeForce Gaming With More Games on More Networks to More Devices
    (3 min) GeForce NOW is kicking off the new year by bringing more games, more devices and more networks to our cloud gaming ecosystem. The next pair of Electronic Arts games, Battlefield 4 and Battlefield V, is streaming on GeForce NOW. We’re also working closely with a few titans in their respective industries: AT&T and Samsung. AT&T Read article > The post GeForce NOW Delivers Legendary GeForce Gaming With More Games on More Networks to More Devices appeared first on The Official NVIDIA Blog.
  • MIT News - Artificial intelligence

    Meet the 2021-22 Accenture Fellows
    (5 min) The 2021-22 Accenture Fellows are bolstering research and igniting ideas to help transform global business.
  • AWS Machine Learning Blog

    Take advantage of advanced deployment strategies using Amazon SageMaker deployment guardrails
    (11 min) Deployment guardrails in Amazon SageMaker provide a new set of deployment capabilities allowing you to implement advanced deployment strategies that minimize risk when deploying new model versions on SageMaker hosting. Depending on your use case, you can use a variety of deployment strategies to release new model versions. Each of these strategies relies on a […]
    Train graph neural nets for millions of proteins on Amazon SageMaker and Amazon DocumentDB (with MongoDB compatibility)
    (11 min) There are over 180,000 unique proteins with 3D structures determined, with tens of thousands new structures resolved every year. This is only a small fraction of the 200 million known proteins with distinctive sequences. Recent deep learning algorithms such as AlphaFold can accurately predict 3D structures of proteins using their sequences, which help scale the […]
  • Artificial Intelligence

    The Stargate sequence from Stanley Kubrick's 2001: A Space Odyssey, remade using AI, in the style of visionary artist Alex Grey.
    (0 min) submitted by /u/glenniszen [link] [comments]
    Good Object-recognition pre-trained models
    (1 min) I used yolov5 for object recognition and it does a fair job, but it also sees my hand as a bird in a lot of situations and leaves a lot to be desired. I know if they're driving off of similar computer vision models, there's gotta be a better one out there for detecting objects. PS I've been using pytorch for this so if you find a good model please also drop a link on how to get it up and running if you have it. submitted by /u/acraber [link] [comments]
    I gave it the gameplay section of the Geometry Dash wikipedia article, this is what I got
    (1 min) Geometry Dash features an hourglass timer with three sections in the shape of circles and triangles. The timer starts in the upper left corner of the in-game map and ends when it falls into the lower right corner, allowing players to use up their time. When it falls to the bottom center of the map, a demon car appears and must be navigated. The demon is a curved ramp for the player's icon, and every time the player's icon touches a portion of the demon, its ramp is altered. Also, the demon's top speed increases by 7x every time the player touches the bottom center. If the player touches the edge of the demon, then the demon will fall, taking a 5x amount of coins from the player's meter. At the end of the level, the demon car drops into a pit.[5] submitted by /u/Alternative-Ad-3041 [link] [comments]
    EndlessVN open alpha in march 2022
    (0 min) submitted by /u/roblox22y [link] [comments]
    Is there a AI which is able to edit images to make them look drawn?
    (0 min) With small mistakes, and only some different colors? submitted by /u/xXLisa28Xx [link] [comments]
    BadA$$ Kittays - Generated with Cyborg Love
    (0 min) submitted by /u/NeurogenicArtist [link] [comments]
    [R] A Neural Network Solves, Grades & Generates University-Level Mathematics Problems by Program Synthesis
    (1 min) In the new paper A Neural Network Solves and Generates Mathematics Problems by Program Synthesis: Calculus, Differential Equations, Linear Algebra, and More, a research team from MIT, Columbia University, Harvard University and University of Waterloo proposes a neural network that can solve university-level mathematics problems via program synthesis. Here is a quick read: A Neural Network Solves, Grades & Generates University-Level Mathematics Problems by Program Synthesis. The paper A Neural Network Solves and Generates Mathematics Problems by Program Synthesis: Calculus, Differential Equations, Linear Algebra, and More is on arXiv. submitted by /u/Yuqing7 [link] [comments]
    Top AI Trends of 2022
    (0 min) submitted by /u/Beautiful-Credit-868 [link] [comments]
    I put the word 'death' in a text to image AI and this is what I got...
    (1 min) submitted by /u/smartpug967 [link] [comments]
    Researchers Propose A Novel Parameter Differentiation-Based Method That Can Automatically Determine Which Parameters Should Be Shared And Which Ones Should Be Language-Specific
    (1 min) In recent years, neural machine translation (NMT) has attracted a lot of attention and has had a lot of success. While traditional NMT is capable of translating a single language pair, training a separate model for each language pair is time-consuming, especially given the world’s thousands of languages. As a result, multilingual NMT is designed to handle many language pairs in a single model, lowering the cost of offline training and online deployment significantly. Furthermore, parameter sharing in multilingual neural machine translation promotes positive knowledge transfer between languages and is advantageous for low-resource translation. Despite the advantages of cooperative training with a completely shared model, the MNMT approach has a model capacity problem. The shared parameters are more likely to preserve broad knowledge while ignoring language-specific knowledge. To improve the model capacity, researchers use heuristic design to create extra language-specific components and build a Multilingual neural machine translation (MNMT) model with a mix of shared and language-specific characteristics, such as the language-specific attention, lightweight language adaptor, or language-specific routing layer. Continue Reading Paper: https://arxiv.org/pdf/2112.13619v1.pdf Github: https://github.com/voidmagic/parameter-differentiation submitted by /u/ai-lover [link] [comments]
  • Machine Learning

    [P] ray-skorch - distributed PyTorch on Ray with sklearn API
    (1 min) tl;dr: train PyTorch models on large tabular datasets with a scikit-learn (skorch) API Hi r/MachineLearning, I'm the principal author of ray-skorch, a library that lets you run distributed PyTorch training on large-scale datasets while providing a familiar, scikit-learn compatible skorch API, integrating well with the rest of the scikit-learn ecosystem. Under the hood, ray-skorch uses Ray Train for distributed PyTorch training and Ray Data for handling and shuffling large datasets. ray-skorch works only with tabular data. Currently, it can use numpy arrays, pandas dataframes and Ray Data Datasets. pip install ray-skorch You can switch your skorch code to ray-skorch just by changing a few lines: import numpy as np from sklearn.datasets import make_classification from torch import nn # pip install pytorch_tabnet from pytorch_tabnet.tab_network import TabNet from ray_skorch import RayTrainNeuralNet X, y = make_classification(1000, 20, n_informative=10, random_state=0) X = X.astype(np.float32) y = y.astype(np.int64) net = RayTrainNeuralNet( TabNet, num_workers=2, # the only new mandatory argument criterion=nn.CrossEntropyLoss, max_epochs=10, lr=0.1, # TabNet specific arguments module__input_dim=20, module__output_dim=2, # required for classification loss funcs iterator_train__unsqueeze_label_tensor=False, iterator_valid__unsqueeze_label_tensor=False, ) net.fit(X, y) # predict_proba returns a ray.data.Dataset y_proba = net.predict_proba(X).to_pandas() More examples, including ones on bigger datasets, can be found here - https://github.com/Yard1/ray-skorch/tree/main/examples The package is experimental, and I’d love to hear your feedback - both on the package itself and on the concept of distributed training on tabular data with simple, familiar APIs. Any comments, suggestions or bug reports are hugely appreciated! submitted by /u/Yard1PL [link] [comments]
    [D] Deep Learning is the future of gaming.
    (1 min) Hey everybody --- I know this isn't hard core AI research but I have been thinking a lot about deep learning and gaming recently and put together a little presentation on how I see things unfolding. Lots of cool research featured in the video. https://www.youtube.com/watch?v=JDL8rZzYVwQ I go over: Photorealistic neural rendering Deepfakes for gaming (https://www.youtube.com/watch?v=RR7u11ANDWE is a better example than the obama one I used) GAN theft auto and dreaming up game engines with neural networks Large language models for building realistic NPCs and storytelling Using OpenAI Codex to automatically program games. It's really clear that deep learning is the most important technology to impact gaming since the advent of 3D graphics. Would love to talk with anybody who is working on stuff in this space. submitted by /u/sabalaba [link] [comments]
    [D] Style Transfer with Noise Vector
    (1 min) Hi everyone, I'm looking for a model which can perform style transfer, but also takes an auxiliary noise vector similar to that for StyleGAN to generate many stylized images for a single input image. Is anyone aware of any model meeting these requirements? My best idea so far is to first embed the image into the StyleGAN latent space with this paper, and then add noise to that vector. submitted by /u/Sebass13 [link] [comments]
    [D] Interpolation, Extrapolation and Linearisation (Prof. Yann LeCun, Dr. Randall Balestriero)
    (1 min) Special machine learning street talk episode! Yann LeCun thinks that it's specious to say neural network models are interpolating because in high dimensions, everything is extrapolation. Recently Dr. Randall Balestriero, Dr. Jerome Pesente and prof. Yann LeCun released their paper learning in high dimensions always amounts to extrapolation. This discussion has completely changed how we think about neural networks and their behaviour. In the intro we talk about the spline theory of NNs, interpolation in NNs and the curse of dimensionality. YT: https://youtu.be/86ib0sfdFtw Pod: https://anchor.fm/machinelearningstreettalk/episodes/061-Interpolation--Extrapolation-and-Linearisation-Prof--Yann-LeCun--Dr--Randall-Balestriero-e1cgdr0 References: Learning in High Dimension Always Amounts to Extrapolation [Randall Balestriero, Jerome Pesenti, Yann LeCun] https://arxiv.org/abs/2110.09485 A Spline Theory of Deep Learning [Dr. Balestriero, baraniuk] https://proceedings.mlr.press/v80/balestriero18b.html Neural Decision Trees [Dr. Balestriero] https://arxiv.org/pdf/1702.07360.pdf Interpolation of Sparse High-Dimensional Data [Dr. Thomas Lux] https://tchlux.github.io/papers/tchlux-2020-NUMA.pdf submitted by /u/timscarfe [link] [comments]
    [D] Neural Networks using a generic GPU framework
    (2 min) I have a (personal) ML project that uses CNNs but I have two little problems: 1. not everyone has a NVidia GPU at home (myself included, sadly); 2. The CNN needs to be trained every time it is used (it's photo to photo style transfer). So, what would be a good framework to implement the CNN for training (targeting desktop only)? I thought about using OpenGL, but I don't know if using GLSL shaders would be a good fit for it. submitted by /u/crimsom_king [link] [comments]
    [D] What are interviews usually like for ML positions?
    (1 min) For context, I'm applying for PhD level positions. Should I expect technical interviews including coding challenges similar to SWE? Any advice on prepping? submitted by /u/_Dark_Forest [link] [comments]
    Deep Learning Interviews: Hundreds of fully solved job interview questions from a wide range of key topics in AI
    (0 min) submitted by /u/pit_station [link] [comments]
    [N] Launching DagsHub 2.0 – Git-integrated data labeling and smart ML discussions
    (2 min) TL;DR – DagsHub is integrated with Label Studio, and you can now open datasets from Git and DVC remotes, label them and commit labels back, without doing any DevOps. You can also comment on labels, bounding boxes, or any file. Check out the example project, or try out the tutorial. Comparing annotations Hi r/ML! I'm one of the creators of DagsHub (https://www.dagshub.com). We help ML practitioners create a central repository for their projects, where they can leverage open-source tools to version datasets and models, track experiments, and starting today – label data, and comment on anything. Like GitHub for machine learning (you probably heard that before, but we mean it). Our vision is that anyone could jump into an open-source data science project and contribute code, data, labeling,…
    [D] Why is VAE used instead of AutoEncoder in the World Models paper (https://arxiv.org/pdf/1803.10122.pdf)?
    (2 min) Hi All, I was just reading this paper and was wondering if we just want to achieve a compact version of the original representation we could just use a traditional AutoEncoder. Is there any specific reason the VAE is used? Thanks! submitted by /u/StageTraditional636 [link] [comments]
    [R] Play against an AI to detect fake audio
    (1 min) Hi everybody, i'm a PhD student interested in audio spoofs (voice recordings faked with the help of AI), and have developed an online game: You play against an artificial intelligence and try to distinguish spoofed from real audio recordings. It's fun, and very much supports my research. All partificpation (i.e. playing the game), comments or suggestions are welcome! https://deepfake-demo.aisec.fraunhofer.de/ submitted by /u/mummni [link] [comments]
    [D] What are the reviewers' score of the submissions nominated for best paper award in top ML conferences such as NeurIPS, ICML, AISTATS, etc.?
    (1 min) I submitted a paper to AISTATS 2022 that can be a breakthrough with outstanding contributions. The paper received 876 scores from reviewers that could only improve to 877 after the rebuttals. What are the chances that our submission enters the short list for best paper recognition? What are the average reviewers' score of the ones getting nominated in these top conferences? submitted by /u/Cyrus_the-great [link] [comments]
    [D]who knows the paper address of the code?
    (0 min) https://github.com/yuzisheng/trajectory-compress especially the Spatio-Temporal Curvature Streaming submitted by /u/choayue [link] [comments]
    [P] Sieve: We processed ~24 hours of security footage in <10 mins (now semantically searchable per-frame!)
    (6 min) Hey everyone! I’m one of the creators of Sieve, and I’m excited to be sharing it! Sieve is an API that helps you store, process, and automatically search your video data–instantly and efficiently. Just think 10 cameras recording footage at 30 FPS, 24/7. That would be 27 million frames generated in a single day. The videos might be searchable by timestamp, but finding moments of interest is like searching for a needle in a haystack. We built this visual demo (link here) a little while back which we’d love to get feedback on. It’s ~24 hours of security footage that our API processed in <10 mins and has simple querying and export functionality enabled. We see applications in better understanding what data you have, figuring out which data to send to labeling, sampling datasets for training, and building multiple test sets for models by scenario. To try it on your videos: https://github.com/Sieve-Data/automatic-video-processing Visual dashboard walkthrough: https://youtu.be/_uyjp_HGZl4 https://preview.redd.it/bn8hoqoa1m981.png?width=2540&format=png&auto=webp&s=25fb08037438593291fecf7e50ca58ec1f9bea72 https://preview.redd.it/jwkd7uoa1m981.png?width=2540&format=png&auto=webp&s=e25382b4b09855e5934608754a8b74bdbaf93204 https://preview.redd.it/0dd74toa1m981.png?width=2540&format=png&auto=webp&s=05b7625195947b8f15891a9019070efa3730b336 https://preview.redd.it/alg4ruoa1m981.png?width=2540&format=png&auto=webp&s=f5caad143b0d23f3add08f431d0ada322ae4e84d https://preview.redd.it/8c2pw0pa1m981.png?width=2540&format=png&auto=webp&s=e6438f03e3fc7a00ccdf01c9b7075b9e8752affd submitted by /u/happybirthday290 [link] [comments]
    [D] Paper Summary [Rethinking Segmentation from a Sequence-to-Sequence Perspective with Transfromers]
    (1 min) Hi, I have just published my latest medium article. It is a summary of a scientific paper that aims to eliminate the effect of locality which is one of the limitations of CNNs. In this attempt, researchers tried to reform the image semantic segmentation problem then operate a proposed transformer, and finally, introduce three different decoder architectures. Please read it and give me your feedback. If you find it interesting, you can share it with others who are interested in ML as well. Also, if you find it helpful, you can follow me on medium to be updated on my forthcoming articles.🙂 https://rezayazdanfar.medium.com/26868efacc52 submitted by /u/rezayazdanfar [link] [comments]
    [D] Which tools can be helpful for annotation of videos for action recognition?
    (1 min) There is a team in my university who work on ergonomics. They want to do action recognition on some videos. They approached me for help. I work on images. I don't have idea about videos. I have dataset. I want to annotate key points in each frame. Please tell me which tools can be helpful for annotation of videos? submitted by /u/SAbdusSamad [link] [comments]
  • Reinforcement Learning

    Training with multiple agents.
    (0 min) Does anyone know if any of the open source RL libraries support multi-agent training in which agents can have different actions spaces? submitted by /u/YearPersonal5709 [link] [comments]
    Difference between cooperative games and stochastic games
    (1 min) Some say that cooperative games are a subset of stochastic games. But I don't find how. In cooperative games, all the agents have a team reward and have different states whereas in stochastic games, all the agents receives individual reward and they have the same state in a particular time step. Can someone help me understand the difference between these two games? submitted by /u/j0ker_70 [link] [comments]
    Understanding "Stochastic Value Gradients (SVG)"
    (1 min) Hi, I have a hard time understanding the paper "Learning Continuous Control Policies by Stochastic Value Gradients" (https://arxiv.org/pdf/1510.09142.pdf). So, in my understanding is that the stochastic value gradient could be computed easily on imagined ("planned") data, where we rollout the policy in the learned world model. However, this seems undesirable due to compounding model errors. So far, so good. The above issue is fixed when we use real environment samples, although the noise variables η and ξ required to calculate the stochastic value gradient are then unknown. The paper then applies the Bayes rule to reformulate the stochastic value gradient to a tractable formulation. My issue: to compute the stochastic value gradient, the noise variables must be inferred. On this issue, the paper simply states "... infer the missing noise variables, possibly by sampling from p(η,ξ|s, a, s′)". How can we sample these noise variables? How can we infer the noise variables? Any help would be greatly appreciated! submitted by /u/Internal-Brush4929 [link] [comments]
    Training Data
    (1 min) Hi, i'm pretty new to this kind of learning task and i found hard to understand some base concepts. I'm studying the Decision Transformer paper and for an university problem i need to train that model using the MiniWorld environment. While i was looking at the code that is provided by the author of MiniWorld i got a question. How i get training data? The authors of Decision Transformer uses d4rl that provide some offline dataset. How can create training data in a similar way using my environment? submitted by /u/maverik75 [link] [comments]
    Anyone using RL for Trading stocks?
    (2 min) submitted by /u/GarantBM [link] [comments]
    How to make gym a parallel environment?
    (1 min) I'm run gym environment CartPole-v0, but my GPU usage is low. I get a resolution that I can use N same policy Networks to get actions for N envs. I tried agymc, but it can't satisfy my need. Because it can only copy the envs, and cannot use N Networks. neural.py (github.com) So I write a code as above, but I get this error as below, how can I repair the bug: Exception in thread Thread-7: File "C:\Users\afuler\AppData\Local\Programs\Python\Python39\lib\threading.py", line 973, in _bootstrap_inner self.run() File "C:\Users\afuler\AppData\Local\Programs\Python\Python39\lib\threading.py", line 910, in run self._target(*self._args, **self._kwargs) File "d:\workspace\rl\nonnueral\nonnueralrl.py", line 271, in predict envList[worker_num].render() File "C:\Users\afuler\AppData\Local\Pro…
    PPO convergence when moving to continuous action space
    (1 min) Hello :D, I had a PPO implementation with convolutional neural networks that worked just fine with discrete action space (environment is a grid world where an agent has 4 possible actions=directions). I took the same implementation and switched to continuous action space (2 actions = force over x and y axis). The model does pretty well in the first 7mil steps (reward converges), but then the reward suddenly goes down. Any idea what could be the reason? could it be the decaying variance I am using for the continuous action selection? submitted by /u/AhmedNizam_ [link] [comments]

2022-01-03

  • cs.LG updates on arXiv.org

    CONFAIR: Configurable and Interpretable Algorithmic Fairness. (arXiv:2111.08878v3 [cs.LG] UPDATED)
    (2 min) The rapid growth of data in the recent years has led to the development of complex learning algorithms that are often used to make decisions in real world. While the positive impact of the algorithms has been tremendous, there is a need to mitigate any bias arising from either training samples or implicit assumptions made about the data samples. This need becomes critical when algorithms are used in automated decision making systems that can hugely impact people's lives. Many approaches have been proposed to make learning algorithms fair by detecting and mitigating bias in different stages of optimization. However, due to a lack of a universal definition of fairness, these algorithms optimize for a particular interpretation of fairness which makes them limited for real world use. Moreover, an underlying assumption that is common to all algorithms is the apparent equivalence of achieving fairness and removing bias. In other words, there is no user defined criteria that can be incorporated into the optimization procedure for producing a fair algorithm. Motivated by these shortcomings of existing methods, we propose the CONFAIR procedure that produces a fair algorithm by incorporating user constraints into the optimization procedure. Furthermore, we make the process interpretable by estimating the most predictive features from data. We demonstrate the efficacy of our approach on several real world datasets using different fairness criteria.
    ML-Decoder: Scalable and Versatile Classification Head. (arXiv:2111.12933v2 [cs.CV] UPDATED)
    (2 min) In this paper, we introduce ML-Decoder, a new attention-based classification head. ML-Decoder predicts the existence of class labels via queries, and enables better utilization of spatial data compared to global average pooling. By redesigning the decoder architecture, and using a novel group-decoding scheme, ML-Decoder is highly efficient, and can scale well to thousands of classes. Compared to using a larger backbone, ML-Decoder consistently provides a better speed-accuracy trade-off. ML-Decoder is also versatile - it can be used as a drop-in replacement for various classification heads, and generalize to unseen classes when operated with word queries. Novel query augmentations further improve its generalization ability. Using ML-Decoder, we achieve state-of-the-art results on several classification tasks: on MS-COCO multi-label, we reach 91.4% mAP; on NUS-WIDE zero-shot, we reach 31.1% ZSL mAP; and on ImageNet single-label, we reach with vanilla ResNet50 backbone a new top score of 80.7%, without extra data or distillation. Public code is available at: https://github.com/Alibaba-MIIL/ML_Decoder
    Replacing Rewards with Examples: Example-Based Policy Search via Recursive Classification. (arXiv:2103.12656v2 [cs.LG] UPDATED)
    (2 min) Reinforcement learning (RL) algorithms assume that users specify tasks by manually writing down a reward function. However, this process can be laborious and demands considerable technical expertise. Can we devise RL algorithms that instead enable users to specify tasks simply by providing examples of successful outcomes? In this paper, we derive a control algorithm that maximizes the future probability of these successful outcome examples. Prior work has approached similar problems with a two-stage process, first learning a reward function and then optimizing this reward function using another RL algorithm. In contrast, our method directly learns a value function from transitions and successful outcomes, without learning this intermediate reward function. Our method therefore requires fewer hyperparameters to tune and lines of code to debug. We show that our method satisfies a new data-driven Bellman equation, where examples take the place of the typical reward function term. Experiments show that our approach outperforms prior methods that learn explicit reward functions.
    A Survey of Embodied AI: From Simulators to Research Tasks. (arXiv:2103.04918v6 [cs.AI] UPDATED)
    (2 min) There has been an emerging paradigm shift from the era of "internet AI" to "embodied AI", where AI algorithms and agents no longer learn from datasets of images, videos or text curated primarily from the internet. Instead, they learn through interactions with their environments from an egocentric perception similar to humans. Consequently, there has been substantial growth in the demand for embodied AI simulators to support various embodied AI research tasks. This growing interest in embodied AI is beneficial to the greater pursuit of Artificial General Intelligence (AGI), but there has not been a contemporary and comprehensive survey of this field. This paper aims to provide an encyclopedic survey for the field of embodied AI, from its simulators to its research. By evaluating nine current embodied AI simulators with our proposed seven features, this paper aims to understand the simulators in their provision for use in embodied AI research and their limitations. Lastly, this paper surveys the three main research tasks in embodied AI -- visual exploration, visual navigation and embodied question answering (QA), covering the state-of-the-art approaches, evaluation metrics and datasets. Finally, with the new insights revealed through surveying the field, the paper will provide suggestions for simulator-for-task selections and recommendations for the future directions of the field.
    A GAN-Like Approach for Physics-Based Imitation Learning and Interactive Character Control. (arXiv:2105.10066v4 [cs.GR] UPDATED)
    (2 min) We present a simple and intuitive approach for interactive control of physically simulated characters. Our work builds upon generative adversarial networks (GAN) and reinforcement learning, and introduces an imitation learning framework where an ensemble of classifiers and an imitation policy are trained in tandem given pre-processed reference clips. The classifiers are trained to discriminate the reference motion from the motion generated by the imitation policy, while the policy is rewarded for fooling the discriminators. Using our GAN-based approach, multiple motor control policies can be trained separately to imitate different behaviors. In runtime, our system can respond to external control signal provided by the user and interactively switch between different policies. Compared to existing methods, our proposed approach has the following attractive properties: 1) achieves state-of-the-art imitation performance without manually designing and fine tuning a reward function; 2) directly controls the character without having to track any target reference pose explicitly or implicitly through a phase state; and 3) supports interactive policy switching without requiring any motion generation or motion matching mechanism. We highlight the applicability of our approach in a range of imitation and interactive control tasks, while also demonstrating its ability to withstand external perturbations as well as to recover balance. Overall, our approach generates high-fidelity motion, has low runtime cost, and can be easily integrated into interactive applications and games.
    Analysis of Regularized Learning in Banach Spaces. (arXiv:2109.03159v4 [cs.LG] UPDATED)
    (2 min) This article presents a new way to study the theory of regularized learning for generalized data in Banach spaces including representer theorems and convergence theorems. The generalized data are composed of linear functionals and real scalars as the input and output elements to represent the discrete information of different local models. By the extension of the classical machine learning, the empirical risks are computed by the generalized data and the loss functions. According to the techniques of regularization, the exact solutions are approximated globally by minimizing the regularized empirical risks over the Banach spaces. The existence and convergence of the approximate solutions are guaranteed by the relative compactness of the generalized input data in the predual spaces of the Banach spaces.
    Leveraging in-domain supervision for unsupervised image-to-image translation tasks via multi-stream generators. (arXiv:2112.15091v1 [cs.CV])
    (2 min) Supervision for image-to-image translation (I2I) tasks is hard to come by, but bears significant effect on the resulting quality. In this paper, we observe that for many Unsupervised I2I (UI2I) scenarios, one domain is more familiar than the other, and offers in-domain prior knowledge, such as semantic segmentation. We argue that for complex scenes, figuring out the semantic structure of the domain is hard, especially with no supervision, but is an important part of a successful I2I operation. We hence introduce two techniques to incorporate this invaluable in-domain prior knowledge for the benefit of translation quality: through a novel Multi-Stream generator architecture, and through a semantic segmentation-based regularization loss term. In essence, we propose splitting the input data according to semantic masks, explicitly guiding the network to different behavior for the different regions of the image. In addition, we propose training a semantic segmentation network along with the translation task, and to leverage this output as a loss term that improves robustness. We validate our approach on urban data, demonstrating superior quality in the challenging UI2I tasks of converting day images to night ones. In addition, we also demonstrate how reinforcing the target dataset with our augmented images improves the training of downstream tasks such as the classical detection one.
    When are Iterative Gaussian Processes Reliably Accurate?. (arXiv:2112.15246v1 [cs.LG])
    (2 min) While recent work on conjugate gradient methods and Lanczos decompositions have achieved scalable Gaussian process inference with highly accurate point predictions, in several implementations these iterative methods appear to struggle with numerical instabilities in learning kernel hyperparameters, and poor test likelihoods. By investigating CG tolerance, preconditioner rank, and Lanczos decomposition rank, we provide a particularly simple prescription to correct these issues: we recommend that one should use a small CG tolerance ($\epsilon \leq 0.01$) and a large root decomposition size ($r \geq 5000$). Moreover, we show that L-BFGS-B is a compelling optimizer for Iterative GPs, achieving convergence with fewer gradient updates.
    Processing Images from Multiple IACTs in the TAIGA Experiment with Convolutional Neural Networks. (arXiv:2112.15382v1 [astro-ph.IM])
    (2 min) Extensive air showers created by high-energy particles interacting with the Earth atmosphere can be detected using imaging atmospheric Cherenkov telescopes (IACTs). The IACT images can be analyzed to distinguish between the events caused by gamma rays and by hadrons and to infer the parameters of the event such as the energy of the primary particle. We use convolutional neural networks (CNNs) to analyze Monte Carlo-simulated images from the telescopes of the TAIGA experiment. The analysis includes selection of the images corresponding to the showers caused by gamma rays and estimating the energy of the gamma rays. We compare performance of the CNNs using images from a single telescope and the CNNs using images from two telescopes as inputs.
    A Survey on Deep learning based Document Image Enhancement. (arXiv:2112.02719v3 [cs.CV] UPDATED)
    (2 min) Digitized documents such as scientific articles, tax forms, invoices, contract papers, historic texts are widely used nowadays. These document images could be degraded or damaged due to various reasons including poor lighting conditions, shadow, distortions like noise and blur, aging, ink stain, bleed-through, watermark, stamp, etc. Document image enhancement plays a crucial role as a pre-processing step in many automated document analysis and recognition tasks such as character recognition. With recent advances in deep learning, many methods are proposed to enhance the quality of these document images. In this paper, we review deep learning-based methods, datasets, and metrics for six main document image enhancement tasks, including binarization, debluring, denoising, defading, watermark removal, and shadow removal. We summarize the recent works for each task and discuss their features, challenges, and limitations. We introduce multiple document image enhancement tasks that have received little to no attention, including over and under exposure correction, super resolution, and bleed-through removal. We identify several promising research directions and opportunities for future research.
    A toolkit for data-driven discovery of governing equations in high-noise regimes. (arXiv:2111.04870v2 [cs.LG] UPDATED)
    (2 min) We consider the data-driven discovery of governing equations from time-series data in the limit of high noise. The algorithms developed describe an extensive toolkit of methods for circumventing the deleterious effects of noise in the context of the sparse identification of nonlinear dynamics (SINDy) framework. We offer two primary contributions, both focused on noisy data acquired from a system x' = f(x). First, we propose, for use in high-noise settings, an extensive toolkit of critically enabling extensions for the SINDy regression method, to progressively cull functionals from an over-complete library and yield a set of sparse equations that regress to the derivate x'. These innovations can extract sparse governing equations and coefficients from high-noise time-series data (e.g. 300% added noise). For example, it discovers the correct sparse libraries in the Lorenz system, with median coefficient estimate errors equal to 1% - 3% (for 50% noise), 6% - 8% (for 100% noise); and 23% - 25% (for 300% noise). The enabling modules in the toolkit are combined into a single method, but the individual modules can be tactically applied in other equation discovery methods (SINDy or not) to improve results on high-noise data. Second, we propose a technique, applicable to any model discovery method based on x' = f(x), to assess the accuracy of a discovered model in the context of non-unique solutions due to noisy data. Currently, this non-uniqueness can obscure a discovered model's accuracy and thus a discovery method's effectiveness. We describe a technique that uses linear dependencies among functionals to transform a discovered model into an equivalent form that is closest to the true model, enabling more accurate assessment of a discovered model's accuracy.
    Inferring perceptual decision making parameters from behavior in production and reproduction tasks. (arXiv:2112.15521v1 [cs.LG])
    (2 min) Bayesian models of behavior have provided computational level explanations in a range of psychophysical tasks. One fundamental experimental paradigm is the production or reproduction task, in which subjects are instructed to generate an action that either reproduces a previously sensed stimulus magnitude or achieves a target response. This type of task therefore distinguishes itself from other psychophysical tasks in that the responses are on a continuum and effort plays an important role with increasing response magnitude. Based on Bayesian decision theory we present an inference method to recover perceptual uncertainty, response variability, and the cost function underlying human responses. Crucially, the cost function is parameterized such that effort is explicitly included. We present a hybrid inference method employing MCMC sampling utilizing appropriate proposal distributions and an inner loop utilizing amortized inference with a neural network that approximates the mode of the optimal response distribution. We show how this model can be utilized to avoid unidentifiability of experimental designs and that parameters can be recovered through validation on synthetic and application to experimental data. Our approach will enable behavioral scientists to perform Bayesian inference of decision making parameters in production and reproduction tasks.
    Embedding Information onto a Dynamical System. (arXiv:2105.10766v3 [math.DS] UPDATED)
    (2 min) The celebrated Takens' embedding theorem concerns embedding an attractor of a dynamical system in a Euclidean space of appropriate dimension through a generic delay-observation map. The embedding also establishes a topological conjugacy. In this paper, we show how an arbitrary sequence can be mapped into another space as an attractive solution of a nonautonomous dynamical system. Such mapping also entails a topological conjugacy and an embedding between the sequence and the attractive solution spaces. This result is not a generalization of Takens embedding theorem but helps us understand what exactly is required by discrete-time state space models widely used in applications to embed an external stimulus onto its solution space. Our results settle another basic problem concerning the perturbation of an autonomous dynamical system. We describe what exactly happens to the dynamics when exogenous noise perturbs continuously a local irreducible attracting set (such as a stable fixed point) of a discrete-time autonomous dynamical system.
    Notes on the H-measure of classifier performance. (arXiv:2106.11888v2 [cs.LG] UPDATED)
    (2 min) The H-measure is a classifier performance measure which takes into account the context of application without requiring a rigid value of relative misclassification costs to be set. Since its introduction in 2009 it has become widely adopted. This paper answers various queries which users have raised since its introduction, including questions about its interpretation, the choice of a weighting function, whether it is strictly proper, and its coherence, and relates the measure to other work.
    Frame invariance and scalability of neural operators for partial differential equations. (arXiv:2112.14769v1 [cs.LG])
    (2 min) Partial differential equations (PDEs) play a dominant role in the mathematical modeling of many complex dynamical processes. Solving these PDEs often requires prohibitively high computational costs, especially when multiple evaluations must be made for different parameters or conditions. After training, neural operators can provide PDEs solutions significantly faster than traditional PDE solvers. In this work, invariance properties and computational complexity of two neural operators are examined for transport PDE of a scalar quantity. Neural operator based on graph kernel network (GKN) operates on graph-structured data to incorporate nonlocal dependencies. Here we propose a modified formulation of GKN to achieve frame invariance. Vector cloud neural network (VCNN) is an alternate neural operator with embedded frame invariance which operates on point cloud data. GKN-based neural operator demonstrates slightly better predictive performance compared to VCNN. However, GKN requires an excessively high computational cost that increases quadratically with the increasing number of discretized objects as compared to a linear increase for VCNN.
    Semi-Decentralized Federated Edge Learning for Fast Convergence on Non-IID Data. (arXiv:2104.12678v5 [cs.NI] UPDATED)
    (0 min) Federated edge learning (FEEL) has emerged as an effective approach to reduce the large communication latency in Cloud-based machine learning solutions, while preserving data privacy. Unfortunately, the learning performance of FEEL may be compromised due to limited training data in a single edge cluster. In this paper, we investigate a novel framework of FEEL, namely semi-decentralized federated edge learning (SD-FEEL). By allowing model aggregation across different edge clusters, SD-FEEL enjoys the benefit of FEEL in reducing the training latency, while improving the learning performance by accessing richer training data from multiple edge clusters. A training algorithm for SD-FEEL with three main procedures in each round is presented, including local model updates, intra-cluster and inter-cluster model aggregations, which is proved to converge on non-independent and identically distributed (non-IID) data. We also characterize the interplay between the network topology of the edge servers and the communication overhead of inter-cluster model aggregation on the training performance. Experiment results corroborate our analysis and demonstrate the effectiveness of SD-FFEL in achieving faster convergence than traditional federated learning architectures. Besides, guidelines on choosing critical hyper-parameters of the training algorithm are also provided.
    A Unified Analysis Method for Online Optimization in Normed Vector Space. (arXiv:2112.12134v2 [cs.LG] UPDATED)
    (2 min) We present a unified analysis method that relies on the generalized cosine rule and $\phi$-convex for online optimization in normed vector space using dynamic regret as the performance metric. In combing the update rules, we start with strategy $S$ (a two-parameter variant strategy covering Optimistic-FTRL with surrogate linearized losses), and obtain $S$-I (type-I relaxation variant form of $S$) and $S$-II (type-II relaxation variant form of $S$, which is Optimistic-MD) by relaxation. Regret bounds for $S$-I and $S$-II are the tightest possible. As instantiations, regret bounds of normalized exponentiated subgradient and greedy/lazy projection are better than the currently known optimal results. By replacing losses of online game with monotone operators, and extending the definition of regret, namely regret$^n$, we extend online convex optimization to online monotone optimization, which expands the application scope of $S$-I and $S$-II.
    Sufficient Statistic Memory AMP. (arXiv:2112.15327v1 [cs.IT])
    (2 min) Approximate message passing (AMP) is a promising technique for unknown signal reconstruction of certain high-dimensional linear systems with non-Gaussian signaling. A distinguished feature of the AMP-type algorithms is that their dynamics can be rigorously described by state evolution. However, state evolution does not necessarily guarantee the convergence of iterative algorithms. To solve the convergence problem of AMP-type algorithms in principle, this paper proposes a memory AMP (MAMP) under a sufficient statistic condition, named sufficient statistic MAMP (SS-MAMP). We show that the covariance matrices of SS-MAMP are L-banded and convergent. Given an arbitrary MAMP, we can construct an SS-MAMP by damping, which not only ensures the convergence of MAMP but also preserves the orthogonality of MAMP, i.e., its dynamics can be rigorously described by state evolution. As a byproduct, we prove that the Bayes-optimal orthogonal/vector AMP (BO-OAMP/VAMP) is an SS-MAMP. As a result, we reveal two interesting properties of BO-OAMP/VAMP for large systems: 1) the covariance matrices are L-banded and are convergent in BO-OAMP/VAMP, and 2) damping and memory are useless (i.e., do not bring performance improvement) in BO-OAMP/VAMP. As an example, we construct a sufficient statistic Bayes-optimal MAMP (BO-MAMP), which is Bayes optimal if its state evolution has a unique fixed point and its MSE is not worse than the original BO-MAMP. Finally, simulations are provided to verify the validity and accuracy of the theoretical results.
    Entropy Regularized Optimal Transport Independence Criterion. (arXiv:2112.15265v1 [stat.ML])
    (0 min) Optimal transport (OT) and its entropy regularized offspring have recently gained a lot of attention in both machine learning and AI domains. In particular, optimal transport has been used to develop probability metrics between probability distributions. We introduce in this paper an independence criterion based on entropy regularized optimal transport. Our criterion can be used to test for independence between two samples. We establish non-asymptotic bounds for our test statistic, and study its statistical behavior under both the null and alternative hypothesis. Our theoretical results involve tools from U-process theory and optimal transport theory. We present experimental results on existing benchmarks, illustrating the interest of the proposed criterion.
    Non Asymptotic Bounds for Optimization via Online Multiplicative Stochastic Gradient Descent. (arXiv:2112.07110v4 [stat.ML] UPDATED)
    (2 min) The gradient noise of Stochastic Gradient Descent (SGD) is considered to play a key role in its properties (e.g. escaping low potential points and regularization). Past research has indicated that the covariance of the SGD error done via minibatching plays a critical role in determining its regularization and escape from low potential points. It is however not much explored how much the distribution of the error influences the behavior of the algorithm. Motivated by some new research in this area, we prove universality results by showing that noise classes that have the same mean and covariance structure of SGD via minibatching have similar properties. We mainly consider the Multiplicative Stochastic Gradient Descent (M-SGD) algorithm as introduced by Wu et al., which has a much more general noise class than the SGD algorithm done via minibatching. We establish nonasymptotic bounds for the M-SGD algorithm mainly with respect to the Stochastic Differential Equation corresponding to SGD via minibatching. We also show that the M-SGD error is approximately a scaled Gaussian distribution with mean $0$ at any fixed point of the M-SGD algorithm. We also establish bounds for the convergence of the M-SGD algorithm in the strongly convex regime.
    Financial Vision Based Differential Privacy Applications. (arXiv:2112.14075v2 [cs.LG] UPDATED)
    (2 min) The importance of deep learning data privacy has gained significant attention in recent years. It is probably to suffer data breaches when applying deep learning to cryptocurrency that lacks supervision of financial regulatory agencies. However, there is little relative research in the financial area to our best knowledge. We apply two representative deep learning privacy-privacy frameworks proposed by Google to financial trading data. We designed the experiments with several different parameters suggested from the original studies. In addition, we refer the degree of privacy to Google and Apple companies to estimate the results more reasonably. The results show that DP-SGD performs better than the PATE framework in financial trading data. The tradeoff between privacy and accuracy is low in DP-SGD. The degree of privacy also is in line with the actual case. Therefore, we can obtain a strong privacy guarantee with precision to avoid potential financial loss.
    Score-Based Generative Modeling with Critically-Damped Langevin Diffusion. (arXiv:2112.07068v2 [stat.ML] UPDATED)
    (2 min) Score-based generative models (SGMs) have demonstrated remarkable synthesis quality. SGMs rely on a diffusion process that gradually perturbs the data towards a tractable distribution, while the generative model learns to denoise. The complexity of this denoising task is, apart from the data distribution itself, uniquely determined by the diffusion process. We argue that current SGMs employ overly simplistic diffusions, leading to unnecessarily complex denoising processes, which limit generative modeling performance. Based on connections to statistical mechanics, we propose a novel critically-damped Langevin diffusion (CLD) and show that CLD-based SGMs achieve superior performance. CLD can be interpreted as running a joint diffusion in an extended space, where the auxiliary variables can be considered "velocities" that are coupled to the data variables as in Hamiltonian dynamics. We derive a novel score matching objective for CLD and show that the model only needs to learn the score function of the conditional distribution of the velocity given data, an easier task than learning scores of the data directly. We also derive a new sampling scheme for efficient synthesis from CLD-based diffusion models. We find that CLD outperforms previous SGMs in synthesis quality for similar network architectures and sampling compute budgets. We show that our novel sampler for CLD significantly outperforms solvers such as Euler--Maruyama. Our framework provides new insights into score-based denoising diffusion models and can be readily used for high-resolution image synthesis. Project page and code: https://nv-tlabs.github.io/CLD-SGM.
    Improved Algorithm for the Network Alignment Problem with Application to Binary Diffing. (arXiv:2112.15336v1 [cs.LG])
    (0 min) In this paper, we present a novel algorithm to address the Network Alignment problem. It is inspired from a previous message passing framework of Bayati et al. [2] and includes several modifications designed to significantly speed up the message updates as well as to enforce their convergence. Experiments show that our proposed model outperforms other state-of-the-art solvers. Finally, we propose an application of our method in order to address the Binary Diffing problem. We show that our solution provides better assignment than the reference differs in almost all submitted instances and outline the importance of leveraging the graphical structure of binary programs.
    Importance of Empirical Sample Complexity Analysis for Offline Reinforcement Learning. (arXiv:2112.15578v1 [cs.LG])
    (2 min) We hypothesize that empirically studying the sample complexity of offline reinforcement learning (RL) is crucial for the practical applications of RL in the real world. Several recent works have demonstrated the ability to learn policies directly from offline data. In this work, we ask the question of the dependency on the number of samples for learning from offline data. Our objective is to emphasize that studying sample complexity for offline RL is important, and is an indicator of the usefulness of existing offline algorithms. We propose an evaluation approach for sample complexity analysis of offline RL.
    QueryNet: Attack by Multi-Identity Surrogates. (arXiv:2105.15010v3 [cs.LG] UPDATED)
    (2 min) Deep Neural Networks (DNNs) are acknowledged as vulnerable to adversarial attacks, while the existing black-box attacks require extensive queries on the victim DNN to achieve high success rates. For query-efficiency, surrogate models of the victim are used to generate transferable Adversarial Examples (AEs) because of their Gradient Similarity (GS), i.e., surrogates' attack gradients are similar to the victim's ones. However, it is generally neglected to exploit their similarity on outputs, namely the Prediction Similarity (PS), to filter out inefficient queries by surrogates without querying the victim. To jointly utilize and also optimize surrogates' GS and PS, we develop QueryNet, a unified attack framework that can significantly reduce queries. QueryNet creatively attacks by multi-identity surrogates, i.e., crafts several AEs for one sample by different surrogates, and also uses surrogates to decide on the most promising AE for the query. After that, the victim's query feedback is accumulated to optimize not only surrogates' parameters but also their architectures, enhancing both the GS and the PS. Although QueryNet has no access to pre-trained surrogates' prior, it reduces queries by averagely about an order of magnitude compared to alternatives within an acceptable time, according to our comprehensive experiments: 11 victims (including two commercial models) on MNIST/CIFAR10/ImageNet, allowing only 8-bit image queries, and no access to the victim's training data. The code is available at https://github.com/AllenChen1998/QueryNet.
    Geometric Deep Learning on Molecular Representations. (arXiv:2107.12375v4 [physics.chem-ph] UPDATED)
    (2 min) Geometric deep learning (GDL), which is based on neural network architectures that incorporate and process symmetry information, has emerged as a recent paradigm in artificial intelligence. GDL bears particular promise in molecular modeling applications, in which various molecular representations with different symmetry properties and levels of abstraction exist. This review provides a structured and harmonized overview of molecular GDL, highlighting its applications in drug discovery, chemical synthesis prediction, and quantum chemistry. Emphasis is placed on the relevance of the learned molecular features and their complementarity to well-established molecular descriptors. This review provides an overview of current challenges and opportunities, and presents a forecast of the future of GDL for molecular sciences.
    Beta-VAE Reproducibility: Challenges and Extensions. (arXiv:2112.14278v2 [cs.LG] UPDATED)
    (2 min) $\beta$-VAE is a follow-up technique to variational autoencoders that proposes special weighting of the KL divergence term in the VAE loss to obtain disentangled representations. Unsupervised learning is known to be brittle even on toy datasets and a meaningful, mathematically precise definition of disentanglement remains difficult to find. Here we investigate the original $\beta$-VAE paper and add evidence to the results previously obtained indicating its lack of reproducibility. We also further expand the experimentation of the models and include further more complex datasets in the analysis. We also implement an FID scoring metric for the $\beta$-VAE model and conclude a qualitative analysis of the results obtained. We end with a brief discussion on possible future investigations that can be conducted to add more robustness to the claims.
    State Selection Algorithms and Their Impact on The Performance of Stateful Network Protocol Fuzzing. (arXiv:2112.15498v1 [cs.SE])
    (2 min) The statefulness property of network protocol implementations poses a unique challenge for testing and verification techniques, including Fuzzing. Stateful fuzzers tackle this challenge by leveraging state models to partition the state space and assist the test generation process. Since not all states are equally important and fuzzing campaigns have time limits, fuzzers need effective state selection algorithms to prioritize progressive states over others. Several state selection algorithms have been proposed but they were implemented and evaluated separately on different platforms, making it hard to achieve conclusive findings. In this work, we evaluate an extensive set of state selection algorithms on the same fuzzing platform that is AFLNet, a state-of-the-art fuzzer for network servers. The algorithm set includes existing ones supported by AFLNet and our novel and principled algorithm called AFLNetLegion. The experimental results on the ProFuzzBench benchmark show that (i) the existing state selection algorithms of AFLNet achieve very similar code coverage, (ii) AFLNetLegion clearly outperforms these algorithms in selected case studies, but (iii) the overall improvement appears insignificant. These are unexpected yet interesting findings. We identify problems and share insights that could open opportunities for future research on this topic.
    Transfer learning of phase transitions in percolation and directed percolation. (arXiv:2112.15516v1 [cond-mat.stat-mech])
    (2 min) The latest advances of statistical physics have shown remarkable performance of machine learning in identifying phase transitions. In this paper, we apply domain adversarial neural network (DANN) based on transfer learning to studying non-equilibrium and equilibrium phase transition models, which are percolation model and directed percolation (DP) model, respectively. With the DANN, only a small fraction of input configurations (2d images) needs to be labeled, which is automatically chosen, in order to capture the critical point. To learn the DP model, the method is refined by an iterative procedure in determining the critical point, which is a prerequisite for the data collapse in calculating the critical exponent $\nu_{\perp}$. We then apply the DANN to a two-dimensional site percolation with configurations filtered to include only the largest cluster which may contain the information related to the order parameter. The DANN learning of both models yields reliable results which are comparable to the ones from Monte Carlo simulations. Our study also shows that the DANN can achieve quite high accuracy at much lower cost, compared to the supervised learning.
    Revisiting Experience Replay: Continual Learning by Adaptively Tuning Task-wise Relationship. (arXiv:2112.15402v1 [cs.LG])
    (2 min) Continual learning requires models to learn new tasks while maintaining previously learned knowledge. Various algorithms have been proposed to address this real challenge. Till now, rehearsal-based methods, such as experience replay, have achieved state-of-the-art performance. These approaches save a small part of the data of the past tasks as a memory buffer to prevent models from forgetting previously learned knowledge. However, most of them treat every new task equally, i.e., fixed the hyperparameters of the framework while learning different new tasks. Such a setting lacks the consideration of the relationship/similarity between past and new tasks. For example, the previous knowledge/features learned from dogs are more beneficial for the identification of cats (new task), compared to those learned from buses. In this regard, we propose a meta learning algorithm based on bi-level optimization to adaptively tune the relationship between the knowledge extracted from the past and new tasks. Therefore, the model can find an appropriate direction of gradient during continual learning and avoid the serious overfitting problem on memory buffer. Extensive experiments are conducted on three publicly available datasets (i.e., CIFAR-10, CIFAR-100, and Tiny ImageNet). The experimental results demonstrate that the proposed method can consistently improve the performance of all baselines.
    Machine Learning Trivializing Maps: A First Step Towards Understanding How Flow-Based Samplers Scale Up. (arXiv:2112.15532v1 [hep-lat])
    (0 min) A trivializing map is a field transformation whose Jacobian determinant exactly cancels the interaction terms in the action, providing a representation of the theory in terms of a deterministic transformation of a distribution from which sampling is trivial. Recently, a proof-of-principle study by Albergo, Kanwar and Shanahan [arXiv:1904.12072] demonstrated that approximations of trivializing maps can be `machine-learned' by a class of invertible, differentiable neural models called \textit{normalizing flows}. By ensuring that the Jacobian determinant can be computed efficiently, asymptotically exact sampling from the theory of interest can be performed by drawing samples from a simple distribution and passing them through the network. From a theoretical perspective, this approach has the potential to become more efficient than traditional Markov Chain Monte Carlo sampling techniques, where autocorrelations severely diminish the sampling efficiency as one approaches the continuum limit. A major caveat is that it is not yet understood how the size of models and the cost of training them is expected to scale. As a first step, we have conducted an exploratory scaling study using two-dimensional $\phi^4$ with up to $20^2$ lattice sites. Although the scope of our study is limited to a particular model architecture and training algorithm, initial results paint an interesting picture in which training costs grow very quickly indeed. We describe a candidate explanation for the poor scaling, and outline our intentions to clarify the situation in future work.
    Disjoint Contrastive Regression Learning for Multi-Sourced Annotations. (arXiv:2112.15411v1 [cs.LG])
    (2 min) Large-scale datasets are important for the development of deep learning models. Such datasets usually require a heavy workload of annotations, which are extremely time-consuming and expensive. To accelerate the annotation procedure, multiple annotators may be employed to label different subsets of the data. However, the inconsistency and bias among different annotators are harmful to the model training, especially for qualitative and subjective tasks.To address this challenge, in this paper, we propose a novel contrastive regression framework to address the disjoint annotations problem, where each sample is labeled by only one annotator and multiple annotators work on disjoint subsets of the data. To take account of both the intra-annotator consistency and inter-annotator inconsistency, two strategies are employed.Firstly, a contrastive-based loss is applied to learn the relative ranking among different samples of the same annotator, with the assumption that the ranking of samples from the same annotator is unanimous. Secondly, we apply the gradient reversal layer to learn robust representations that are invariant to different annotators. Experiments on the facial expression prediction task, as well as the image quality assessment task, verify the effectiveness of our proposed framework.
    DeePN$^2$: A deep learning-based non-Newtonian hydrodynamic model. (arXiv:2112.14798v1 [physics.comp-ph])
    (2 min) A long standing problem in the modeling of non-Newtonian hydrodynamics is the availability of reliable and interpretable hydrodynamic models that faithfully encode the underlying micro-scale polymer dynamics. The main complication arises from the long polymer relaxation time, the complex molecular structure, and heterogeneous interaction. DeePN$^2$, a deep learning-based non-Newtonian hydrodynamic model, has been proposed and has shown some success in systematically passing the micro-scale structural mechanics information to the macro-scale hydrodynamics for suspensions with simple polymer conformation and bond potential. The model retains a multi-scaled nature by mapping the polymer configurations into a set of symmetry-preserving macro-scale features. The extended constitutive laws for these macro-scale features can be directly learned from the kinetics of their micro-scale counterparts. In this paper, we carry out further study of DeePN$^2$ using more complex micro-structural models. We show that DeePN$^2$ can faithfully capture the broadly overlooked viscoelastic differences arising from the specific molecular structural mechanics without human intervention.
    Measuring and Sampling: A Metric-guided Subgraph Learning Framework for Graph Neural Network. (arXiv:2112.15015v1 [cs.LG])
    (2 min) Graph neural network (GNN) has shown convincing performance in learning powerful node representations that preserve both node attributes and graph structural information. However, many GNNs encounter problems in effectiveness and efficiency when they are designed with a deeper network structure or handle large-sized graphs. Several sampling algorithms have been proposed for improving and accelerating the training of GNNs, yet they ignore understanding the source of GNN performance gain. The measurement of information within graph data can help the sampling algorithms to keep high-value information while removing redundant information and even noise. In this paper, we propose a Metric-Guided (MeGuide) subgraph learning framework for GNNs. MeGuide employs two novel metrics: Feature Smoothness and Connection Failure Distance to guide the subgraph sampling and mini-batch based training. Feature Smoothness is designed for analyzing the feature of nodes in order to retain the most valuable information, while Connection Failure Distance can measure the structural information to control the size of subgraphs. We demonstrate the effectiveness and efficiency of MeGuide in training various GNNs on multiple datasets.
    A Unified and Constructive Framework for the Universality of Neural Networks. (arXiv:2112.14877v1 [cs.LG])
    (2 min) One of the reasons that many neural networks are capable of replicating complicated tasks or functions is their universality property. The past few decades have seen many attempts in providing constructive proofs for single or class of neural networks. This paper is an effort to provide a unified and constructive framework for the universality of a large class of activations including most of existing activations and beyond. At the heart of the framework is the concept of neural network approximate identity. It turns out that most of existing activations are neural network approximate identity, and thus universal in the space of continuous of functions on compacta. The framework induces several advantages. First, it is constructive with elementary means from functional analysis, probability theory, and numerical analysis. Second, it is the first unified attempt that is valid for most of existing activations. Third, as a by product, the framework provides the first university proof for some of the existing activation functions including Mish, SiLU, ELU, GELU, and etc. Fourth, it discovers new activations with guaranteed universality property. Indeed, any activation\textemdash whose $\k$th derivative, with $\k$ being an integer, is integrable and essentially bounded\textemdash is universal. Fifth, for a given activation and error tolerance, the framework provides precisely the architecture of the corresponding one-hidden neural network with predetermined number of neuron, and the values of weights/biases.
    Survey Descent: A Multipoint Generalization of Gradient Descent for Nonsmooth Optimization. (arXiv:2111.15645v2 [math.OC] UPDATED)
    (2 min) For strongly convex objectives that are smooth, the classical theory of gradient descent ensures linear convergence relative to the number of gradient evaluations. An analogous nonsmooth theory is challenging: even when the objective is smooth at every iterate, the corresponding local models are unstable, and traditional remedies need unpredictably many cutting planes. We instead propose a multipoint generalization of the gradient descent iteration for local optimization. While designed with general objectives in mind, we are motivated by a "max-of-smooth" model that captures subdifferential dimension at optimality. We prove linear convergence when the objective is itself max-of-smooth, and experiments suggest a more general phenomenon.
    A Deep Reinforcement Learning Approach for Solving the Traveling Salesman Problem with Drone. (arXiv:2112.12545v2 [math.OC] UPDATED)
    (0 min) Reinforcement learning has recently shown promise in learning quality solutions in many combinatorial optimization problems. In particular, the attention-based encoder-decoder models show high effectiveness on various routing problems, including the Traveling Salesman Problem (TSP). Unfortunately, they perform poorly for the TSP with Drone (TSP-D), requiring routing a heterogeneous fleet of vehicles in coordination -- a truck and a drone. In TSP-D, the two vehicles are moving in tandem and may need to wait at a node for the other vehicle to join. State-less attention-based decoder fails to make such coordination between vehicles. We propose an attention encoder-LSTM decoder hybrid model, in which the decoder's hidden state can represent the sequence of actions made. We empirically demonstrate that such a hybrid model improves upon a purely attention-based model for both solution quality and computational efficiency. Our experiments on the min-max Capacitated Vehicle Routing Problem (mmCVRP) also confirm that the hybrid model is more suitable for coordinated routing of multiple vehicles than the attention-based model.
    Random Graph-Based Neuromorphic Learning with a Layer-Weaken Structure. (arXiv:2111.08888v2 [cs.LG] UPDATED)
    (0 min) Unified understanding of neuro networks (NNs) gets the users into great trouble because they have been puzzled by what kind of rules should be obeyed to optimize the internal structure of NNs. Considering the potential capability of random graphs to alter how computation is performed, we demonstrate that they can serve as architecture generators to optimize the internal structure of NNs. To transform the random graph theory into an NN model with practical meaning and based on clarifying the input-output relationship of each neuron, we complete data feature mapping by calculating Fourier Random Features (FRFs). Under the usage of this low-operation cost approach, neurons are assigned to several groups of which connection relationships can be regarded as uniform representations of random graphs they belong to, and random arrangement fuses those neurons to establish the pattern matrix, markedly reducing manual participation and computational cost without the fixed and deep architecture. Leveraging this single neuromorphic learning model termed random graph-based neuro network (RGNN) we develop a joint classification mechanism involving information interaction between multiple RGNNs and realize significant performance improvements in supervised learning for three benchmark tasks, whereby they effectively avoid the adverse impact of the interpretability of NNs on the structure design and engineering practice.
    Benign Overfitting in Adversarially Robust Linear Classification. (arXiv:2112.15250v1 [cs.LG])
    (0 min) "Benign overfitting", where classifiers memorize noisy training data yet still achieve a good generalization performance, has drawn great attention in the machine learning community. To explain this surprising phenomenon, a series of works have provided theoretical justification in over-parameterized linear regression, classification, and kernel methods. However, it is not clear if benign overfitting still occurs in the presence of adversarial examples, i.e., examples with tiny and intentional perturbations to fool the classifiers. In this paper, we show that benign overfitting indeed occurs in adversarial training, a principled approach to defend against adversarial examples. In detail, we prove the risk bounds of the adversarially trained linear classifier on the mixture of sub-Gaussian data under $\ell_p$ adversarial perturbations. Our result suggests that under moderate perturbations, adversarially trained linear classifiers can achieve the near-optimal standard and adversarial risks, despite overfitting the noisy training data. Numerical experiments validate our theoretical findings.
    ifMixup: Towards Intrusion-Free Graph Mixup for Graph Classification. (arXiv:2110.09344v2 [cs.LG] UPDATED)
    (0 min) We present a simple and yet effective interpolation-based regularization technique to improve the generalization of Graph Neural Networks (GNNs). Our method leverages the recent advances in Mixup regularizer for vision and text, where random sample pairs and their labels are interpolated to create synthetic samples for training. Unlike images or natural sentences, graphs have arbitrary structure and topology, and even simply deleting or adding one edge from a graph can dramatically change its semantic meanings. This makes interpolating graph inputs very challenging because mixing graph pairs may naturally create graphs with identical structure but with conflict labels, causing the manifold intrusion issue. To cope with this obstacle, we propose a simple input mixing schema for Mixup on graph, coined ifMixup. We theoretically prove that, with a mild assumption, ifMixup guarantees that the mixed graphs are manifold intrusion free. We also empirically verify that ifMixup can effectively regularize the graph classification learning, resulting in superior predictive accuracy over popular graph augmentation baselines.
    Accelerated Primal-Dual Gradient Method for Smooth and Convex-Concave Saddle-Point Problems with Bilinear Coupling. (arXiv:2112.15199v1 [math.OC])
    (0 min) In this paper we study a convex-concave saddle-point problem $\min_x\max_y f(x) + y^\top\mathbf{A} x - g(y)$, where $f(x)$ and $g(y)$ are smooth and convex functions. We propose an Accelerated Primal-Dual Gradient Method for solving this problem which (i) achieves an optimal linear convergence rate in the strongly-convex-strongly-concave regime matching the lower complexity bound (Zhang et al., 2021) and (ii) achieves an accelerated linear convergence rate in the case when only one of the functions $f(x)$ and $g(y)$ is strongly convex or even none of them are. Finally, we obtain a linearly-convergent algorithm for the general smooth and convex-concave saddle point problem $\min_x\max_y F(x,y)$ without requirement of strong convexity or strong concavity.
    A Neural Network Solves and Generates Mathematics Problems by Program Synthesis: Calculus, Differential Equations, Linear Algebra, and More. (arXiv:2112.15594v1 [cs.LG])
    (0 min) We demonstrate that a neural network pre-trained on text and fine-tuned on code solves Mathematics problems by program synthesis. We turn questions into programming tasks, automatically generate programs, and then execute them, perfectly solving university-level problems from MIT's large Mathematics courses (Single Variable Calculus 18.01, Multivariable Calculus 18.02, Differential Equations 18.03, Introduction to Probability and Statistics 18.05, Linear Algebra 18.06, and Mathematics for Computer Science 6.042) as well as questions from a MATH dataset (on Prealgebra, Algebra, Counting and Probability, Number Theory, and Precalculus), the latest benchmark of advanced mathematics problems specifically designed to assess mathematical reasoning. We explore prompt generation methods that enable Transformers to generate question solving programs for these subjects, including solutions with plots. We generate correct answers for a random sample of questions in each topic. We quantify the gap between the original and transformed questions and perform a survey to evaluate the quality and difficulty of generated questions. This is the first work to automatically solve, grade, and generate university-level Mathematics course questions at scale which represents a milestone for higher education.
    Unintended Selection: Persistent Qualification Rate Disparities and Interventions. (arXiv:2111.01201v2 [cs.LG] UPDATED)
    (0 min) Realistically -- and equitably -- modeling the dynamics of group-level disparities in machine learning remains an open problem. In particular, we desire models that do not suppose inherent differences between artificial groups of people -- but rather endogenize disparities by appeal to unequal initial conditions of insular subpopulations. In this paper, agents each have a real-valued feature $X$ (e.g., credit score) informed by a "true" binary label $Y$ representing qualification (e.g., for a loan). Each agent alternately (1) receives a binary classification label $\hat{Y}$ (e.g., loan approval) from a Bayes-optimal machine learning classifier observing $X$ and (2) may update their qualification $Y$ by imitating successful strategies (e.g., seek a raise) within an isolated group $G$ of agents to which they belong. We consider the disparity of qualification rates $\Pr(Y=1)$ between different groups and how this disparity changes subject to a sequence of Bayes-optimal classifiers repeatedly retrained on the global population. We model the evolving qualification rates of each subpopulation (group) using the replicator equation, which derives from a class of imitation processes. We show that differences in qualification rates between subpopulations can persist indefinitely for a set of non-trivial equilibrium states due to uniformed classifier deployments, even when groups are identical in all aspects except initial qualification densities. We next simulate the effects of commonly proposed fairness interventions on this dynamical system along with a new feedback control mechanism capable of permanently eliminating group-level qualification rate disparities. We conclude by discussing the limitations of our model and findings and by outlining potential future work.
    on the effectiveness of generative adversarial network on anomaly detection. (arXiv:2112.15541v1 [cs.LG])
    (0 min) Identifying anomalies refers to detecting samples that do not resemble the training data distribution. Many generative models have been used to find anomalies, and among them, generative adversarial network (GAN)-based approaches are currently very popular. GANs mainly rely on the rich contextual information of these models to identify the actual training distribution. Following this analogy, we suggested a new unsupervised model based on GANs --a combination of an autoencoder and a GAN. Further, a new scoring function was introduced to target anomalies where a linear combination of the internal representation of the discriminator and the generator's visual representation, plus the encoded representation of the autoencoder, come together to define the proposed anomaly score. The model was further evaluated on benchmark datasets such as SVHN, CIFAR10, and MNIST, as well as a public medical dataset of leukemia images. In all the experiments, our model outperformed its existing counterparts while slightly improving the inference time.
    Reward-Free Model-Based Reinforcement Learning with Linear Function Approximation. (arXiv:2110.06394v2 [cs.LG] UPDATED)
    (0 min) We study the model-based reward-free reinforcement learning with linear function approximation for episodic Markov decision processes (MDPs). In this setting, the agent works in two phases. In the exploration phase, the agent interacts with the environment and collects samples without the reward. In the planning phase, the agent is given a specific reward function and uses samples collected from the exploration phase to learn a good policy. We propose a new provably efficient algorithm, called UCRL-RFE under the Linear Mixture MDP assumption, where the transition probability kernel of the MDP can be parameterized by a linear function over certain feature mappings defined on the triplet of state, action, and next state. We show that to obtain an $\epsilon$-optimal policy for arbitrary reward function, UCRL-RFE needs to sample at most $\tilde{\mathcal{O}}(H^5d^2\epsilon^{-2})$ episodes during the exploration phase. Here, $H$ is the length of the episode, $d$ is the dimension of the feature mapping. We also propose a variant of UCRL-RFE using Bernstein-type bonus and show that it needs to sample at most $\tilde{\mathcal{O}}(H^4d(H + d)\epsilon^{-2})$ to achieve an $\epsilon$-optimal policy. By constructing a special class of linear Mixture MDPs, we also prove that for any reward-free algorithm, it needs to sample at least $\tilde \Omega(H^2d\epsilon^{-2})$ episodes to obtain an $\epsilon$-optimal policy. Our upper bound matches the lower bound in terms of the dependence on $\epsilon$ and the dependence on $d$ if $H \ge d$.
    Evaluation of Interpretability for Deep Learning algorithms in EEG Emotion Recognition: A case study in Autism. (arXiv:2111.13208v2 [eess.SP] UPDATED)
    (0 min) Current models on Explainable Artificial Intelligence (XAI) have shown an evident and quantified lack of reliability for measuring feature-relevance when statistically entangled features are proposed for training deep classifiers. There has been an increase in the application of Deep Learning in clinical trials to predict early diagnosis of neuro-developmental disorders, such as Autism Spectrum Disorder (ASD). However, the inclusion of more reliable saliency-maps to obtain more trustworthy and interpretable metrics using neural activity features is still insufficiently mature for practical applications in diagnostics or clinical trials. Moreover, in ASD research the inclusion of deep classifiers that use neural measures to predict viewed facial emotions is relatively unexplored. Therefore, in this study we propose the evaluation of a Convolutional Neural Network (CNN) for electroencephalography (EEG)-based facial emotion recognition decoding complemented with a novel RemOve-And-Retrain (ROAR) methodology to recover highly relevant features used in the classifier. Specifically, we compare well-known relevance maps such as Layer-Wise Relevance Propagation (LRP), PatternNet, Pattern-Attribution, and Smooth-Grad Squared. This study is the first to consolidate a more transparent feature-relevance calculation for a successful EEG-based facial emotion recognition using a within-subject-trained CNN in typically-developed and ASD individuals.
    Expected hypervolume improvement for simultaneous multi-objective and multi-fidelity optimization. (arXiv:2112.13901v2 [cs.LG] UPDATED)
    (0 min) Bayesian optimization has proven to be an efficient method to optimize expensive-to-evaluate systems. However, depending on the cost of single observations, multi-dimensional optimizations of one or more objectives may still be prohibitively expensive. Multi-fidelity optimization remedies this issue by including multiple, cheaper information sources such as low-resolution approximations in numerical simulations. Acquisition functions for multi-fidelity optimization are typically based on exploration-heavy algorithms that are difficult to combine with optimization towards multiple objectives. Here we show that the expected hypervolume improvement policy can act in many situations as a suitable substitute. We incorporate the evaluation cost either via a two-step evaluation or within a single acquisition function with an additional fidelity-related objective. This permits simultaneous multi-objective and multi-fidelity optimization, which allows to accurately establish the Pareto set and front at fractional cost. Benchmarks show a cost reduction of an order of magnitude or more. Our method thus allows for Pareto optimization of extremely expansive black-box functions. The presented methods are simple and straightforward to implement in existing, optimized Bayesian optimization frameworks and can immediately be extended to batch optimization. The techniques can also be used to combine different continuous and/or discrete fidelity dimensions, which makes them particularly relevant for simulation problems in plasma physics, fluid dynamics and many other branches of scientific computing.
    SGTR: End-to-end Scene Graph Generation with Transformer. (arXiv:2112.12970v2 [cs.CV] UPDATED)
    (0 min) Scene Graph Generation (SGG) remains a challenging visual understanding task due to its complex compositional property. Most previous works adopt a bottom-up two-stage or a point-based one-stage approach, which often suffers from overhead time complexity or sub-optimal design assumption. In this work, we propose a novel SGG method to address the aforementioned issues, which formulates the task as a bipartite graph construction problem. To solve the problem, we develop a transformer-based end-to-end framework that first generates the entity and predicate proposal set, followed by inferring directed edges to form the relation triplets. In particular, we develop a new entity-aware predicate representation based on a structural predicate generator to leverage the compositional property of relationships. Moreover, we design a graph assembling module to infer the connectivity of the bipartite scene graph based on our entity-aware structure, enabling us to generate the scene graph in an end-to-end manner. Extensive experimental results show that our design is able to achieve the state-of-the-art or comparable performance on two challenging benchmarks, surpassing most of the existing approaches and enjoying higher efficiency in inference. We hope our model can serve as a strong baseline for the Transformer-based scene graph generation.
    Neural Thompson Sampling. (arXiv:2010.00827v2 [cs.LG] UPDATED)
    (0 min) Thompson Sampling (TS) is one of the most effective algorithms for solving contextual multi-armed bandit problems. In this paper, we propose a new algorithm, called Neural Thompson Sampling, which adapts deep neural networks for both exploration and exploitation. At the core of our algorithm is a novel posterior distribution of the reward, where its mean is the neural network approximator, and its variance is built upon the neural tangent features of the corresponding neural network. We prove that, provided the underlying reward function is bounded, the proposed algorithm is guaranteed to achieve a cumulative regret of $\mathcal{O}(T^{1/2})$, which matches the regret of other contextual bandit algorithms in terms of total round number $T$. Experimental comparisons with other benchmark bandit algorithms on various data sets corroborate our theory.
    Infinite wide (finite depth) Neural Networks benefit from multi-task learning unlike shallow Gaussian Processes -- an exact quantitative macroscopic characterization. (arXiv:2112.15577v1 [cs.LG])
    (0 min) We prove in this paper that wide ReLU neural networks (NNs) with at least one hidden layer optimized with l2-regularization on the parameters enforces multi-task learning due to representation-learning - also in the limit width to infinity. This is in contrast to multiple other idealized settings discussed in the literature where wide (ReLU)-NNs loose their ability to benefit from multi-task learning in the limit width to infinity. We deduce the multi-task learning ability from proving an exact quantitative macroscopic characterization of the learned NN in function space.
    An Efficient Epileptic Seizure Detection Technique using Discrete Wavelet Transform and Machine Learning Classifiers. (arXiv:2109.13811v2 [eess.SP] UPDATED)
    (0 min) This paper presents an epilepsy detection method based on discrete wavelet transform (DWT) and Machine learning classifiers. Here DWT has been used for feature extraction as it provides a better decomposition of the signals in different frequency bands. At first, DWT has been applied to the EEG signal to extract the detail and approximate coefficients or different sub-bands. After the extraction of the coefficients, principal component analysis (PCA) has been applied on different sub-bands and then a feature level fusion technique is used to extract the important features in low dimensional feature space. Three classifiers namely: Support Vector Machine (SVM) classifier, K-Nearest-Neighbor (KNN) classifier, and Naive Bayes (NB) Classifiers have been used in the proposed work for classifying the EEG signals. The proposed method is tested on Bonn databases and provides a maximum of 100% recognition accuracy for KNN, SVM, NB classifiers.
    BiTr-Unet: a CNN-Transformer Combined Network for MRI Brain Tumor Segmentation. (arXiv:2109.12271v2 [eess.IV] UPDATED)
    (0 min) Convolutional neural networks (CNNs) have achieved remarkable success in automatically segmenting organs or lesions on 3D medical images. Recently, vision transformer networks have exhibited exceptional performance in 2D image classification tasks. Compared with CNNs, transformer networks have an appealing advantage of extracting long-range features due to their self-attention algorithm. Therefore, we propose a CNN-Transformer combined model, called BiTr-Unet, with specific modifications for brain tumor segmentation on multi-modal MRI scans. Our BiTr-Unet achieves good performance on the BraTS2021 validation dataset with median Dice score 0.9335, 0.9304 and 0.8899, and median Hausdorff distance 2.8284, 2.2361 and 1.4142 for the whole tumor, tumor core, and enhancing tumor, respectively. On the BraTS2021 testing dataset, the corresponding results are 0.9257, 0.9350 and 0.8874 for Dice score, and 3, 2.2361 and 1.4142 for Hausdorff distance. The code is publicly available at https://github.com/JustaTinyDot/BiTr-Unet.
    Single-Shot Pruning for Offline Reinforcement Learning. (arXiv:2112.15579v1 [cs.LG])
    (0 min) Deep Reinforcement Learning (RL) is a powerful framework for solving complex real-world problems. Large neural networks employed in the framework are traditionally associated with better generalization capabilities, but their increased size entails the drawbacks of extensive training duration, substantial hardware resources, and longer inference times. One way to tackle this problem is to prune neural networks leaving only the necessary parameters. State-of-the-art concurrent pruning techniques for imposing sparsity perform demonstrably well in applications where data distributions are fixed. However, they have not yet been substantially explored in the context of RL. We close the gap between RL and single-shot pruning techniques and present a general pruning approach to the Offline RL. We leverage a fixed dataset to prune neural networks before the start of RL training. We then run experiments varying the network sparsity level and evaluating the validity of pruning at initialization techniques in continuous control tasks. Our results show that with 95% of the network weights pruned, Offline-RL algorithms can still retain performance in the majority of our experiments. To the best of our knowledge, no prior work utilizing pruning in RL retained performance at such high levels of sparsity. Moreover, pruning at initialization techniques can be easily integrated into any existing Offline-RL algorithms without changing the learning objective.
    Hamiltonian Dynamics with Non-Newtonian Momentum for Rapid Sampling. (arXiv:2111.02434v3 [cs.LG] UPDATED)
    (0 min) Sampling from an unnormalized probability distribution is a fundamental problem in machine learning with applications including Bayesian modeling, latent factor inference, and energy-based model training. After decades of research, variations of MCMC remain the default approach to sampling despite slow convergence. Auxiliary neural models can learn to speed up MCMC, but the overhead for training the extra model can be prohibitive. We propose a fundamentally different approach to this problem via a new Hamiltonian dynamics with a non-Newtonian momentum. In contrast to MCMC approaches like Hamiltonian Monte Carlo, no stochastic step is required. Instead, the proposed deterministic dynamics in an extended state space exactly sample the target distribution, specified by an energy function, under an assumption of ergodicity. Alternatively, the dynamics can be interpreted as a normalizing flow that samples a specified energy model without training. The proposed Energy Sampling Hamiltonian (ESH) dynamics have a simple form that can be solved with existing ODE solvers, but we derive a specialized solver that exhibits much better performance. ESH dynamics converge faster than their MCMC competitors enabling faster, more stable training of neural network energy models.
    Representation Learning via Consistent Assignment of Views to Clusters. (arXiv:2112.15421v1 [cs.LG])
    (0 min) We introduce Consistent Assignment for Representation Learning (CARL), an unsupervised learning method to learn visual representations by combining ideas from self-supervised contrastive learning and deep clustering. By viewing contrastive learning from a clustering perspective, CARL learns unsupervised representations by learning a set of general prototypes that serve as energy anchors to enforce different views of a given image to be assigned to the same prototype. Unlike contemporary work on contrastive learning with deep clustering, CARL proposes to learn the set of general prototypes in an online fashion, using gradient descent without the necessity of using non-differentiable algorithms or K-Means to solve the cluster assignment problem. CARL surpasses its competitors in many representations learning benchmarks, including linear evaluation, semi-supervised learning, and transfer learning.
    From Twitter to Traffic Predictor: Next-Day Morning Traffic Prediction Using Social Media Data. (arXiv:2009.13794v3 [cs.SI] UPDATED)
    (0 min) The effectiveness of traditional traffic prediction methods is often extremely limited when forecasting traffic dynamics in early morning. The reason is that traffic can break down drastically during the early morning commute, and the time and duration of this break-down vary substantially from day to day. Early morning traffic forecast is crucial to inform morning-commute traffic management, but they are generally challenging to predict in advance, particularly by midnight. In this paper, we propose to mine Twitter messages as a probing method to understand the impacts of people's work and rest patterns in the evening/midnight of the previous day to the next-day morning traffic. The model is tested on freeway networks in Pittsburgh as experiments. The resulting relationship is surprisingly simple and powerful. We find that, in general, the earlier people rest as indicated from Tweets, the more congested roads will be in the next morning. The occurrence of big events in the evening before, represented by higher or lower tweet sentiment than normal, often implies lower travel demand in the next morning than normal days. Besides, people's tweeting activities in the night before and early morning are statistically associated with congestion in morning peak hours. We make use of such relationships to build a predictive framework which forecasts morning commute congestion using people's tweeting profiles extracted by 5 am or as late as the midnight prior to the morning. The Pittsburgh study supports that our framework can precisely predict morning congestion, particularly for some road segments upstream of roadway bottlenecks with large day-to-day congestion variation. Our approach considerably outperforms those existing methods without Twitter message features, and it can learn meaningful representation of demand from tweeting profiles that offer managerial insights.
    PCACE: A Statistical Approach to Ranking Neurons for CNN Interpretability. (arXiv:2112.15571v1 [cs.CV])
    (0 min) In this paper we introduce a new problem within the growing literature of interpretability for convolution neural networks (CNNs). While previous work has focused on the question of how to visually interpret CNNs, we ask what it is that we care to interpret, that is, which layers and neurons are worth our attention? Due to the vast size of modern deep learning network architectures, automated, quantitative methods are needed to rank the relative importance of neurons so as to provide an answer to this question. We present a new statistical method for ranking the hidden neurons in any convolutional layer of a network. We define importance as the maximal correlation between the activation maps and the class score. We provide different ways in which this method can be used for visualization purposes with MNIST and ImageNet, and show a real-world application of our method to air pollution prediction with street-level images.
    Machine learning based disease diagnosis: A comprehensive review. (arXiv:2112.15538v1 [cs.LG])
    (0 min) Globally, there is a substantial unmet need to diagnose various diseases effectively. The complexity of the different disease mechanisms and underlying symptoms of the patient population presents massive challenges to developing the early diagnosis tool and effective treatment. Machine Learning (ML), an area of Artificial Intelligence (AI), enables researchers, physicians, and patients to solve some of these issues. Based on relevant research, this review explains how Machine Learning (ML) and Deep Learning (DL) are being used to help in the early identification of numerous diseases. To begin, a bibliometric study of the publication is given using data from the Scopus and Web of Science (WOS) databases. The bibliometric study of 1216 publications was undertaken to determine the most prolific authors, nations, organizations, and most cited articles. The review then summarizes the most recent trends and approaches in Machine Learning-based Disease Diagnosis (MLBDD), considering the following factors: algorithm, disease types, data type, application, and evaluation metrics. Finally, the paper highlights key results and provides insight into future trends and opportunities in the MLBDD area.
    BERTphone: Phonetically-Aware Encoder Representations for Utterance-Level Speaker and Language Recognition. (arXiv:1907.00457v2 [cs.CL] UPDATED)
    (0 min) We introduce BERTphone, a Transformer encoder trained on large speech corpora that outputs phonetically-aware contextual representation vectors that can be used for both speaker and language recognition. This is accomplished by training on two objectives: the first, inspired by adapting BERT to the continuous domain, involves masking spans of input frames and reconstructing the whole sequence for acoustic representation learning; the second, inspired by the success of bottleneck features from ASR, is a sequence-level CTC loss applied to phoneme labels for phonetic representation learning. We pretrain two BERTphone models (one on Fisher and one on TED-LIUM) and use them as feature extractors into x-vector-style DNNs for both tasks. We attain a state-of-the-art $C_{\text{avg}}$ of 6.16 on the challenging LRE07 3sec closed-set language recognition task. On Fisher and VoxCeleb speaker recognition tasks, we see an 18% relative reduction in speaker EER when training on BERTphone vectors instead of MFCCs. In general, BERTphone outperforms previous phonetic pretraining approaches on the same data. We release our code and models at https://github.com/awslabs/speech-representations.
    Data-driven advice for interpreting local and global model predictions in bioinformatics problems. (arXiv:2108.06201v2 [stat.ML] UPDATED)
    (0 min) Tree-based algorithms such as random forests and gradient boosted trees continue to be among the most popular and powerful machine learning models used across multiple disciplines. The conventional wisdom of estimating the impact of a feature in tree based models is to measure the \textit{node-wise reduction of a loss function}, which (i) yields only global importance measures and (ii) is known to suffer from severe biases. Conditional feature contributions (CFCs) provide \textit{local}, case-by-case explanations of a prediction by following the decision path and attributing changes in the expected output of the model to each feature along the path. However, Lundberg et al. pointed out a potential bias of CFCs which depends on the distance from the root of a tree. The by now immensely popular alternative, SHapley Additive exPlanation (SHAP) values appear to mitigate this bias but are computationally much more expensive. Here we contribute a thorough comparison of the explanations computed by both methods on a set of 164 publicly available classification problems in order to provide data-driven algorithm recommendations to current researchers. For random forests, we find extremely high similarities and correlations of both local and global SHAP values and CFC scores, leading to very similar rankings and interpretations. Analogous conclusions hold for the fidelity of using global feature importance scores as a proxy for the predictive power associated with each feature.
    Accelerated Proximal Alternating Gradient-Descent-Ascent for Nonconvex Minimax Machine Learning. (arXiv:2112.11663v3 [cs.LG] UPDATED)
    (0 min) Alternating gradient-descent-ascent (AltGDA) is an optimization algorithm that has been widely used for model training in various machine learning applications, which aim to solve a nonconvex minimax optimization problem. However, the existing studies show that it suffers from a high computation complexity in nonconvex minimax optimization. In this paper, we develop a single-loop and fast AltGDA-type algorithm that leverages proximal gradient updates and momentum acceleration to solve regularized nonconvex minimax optimization problems. By identifying the intrinsic Lyapunov function of this algorithm, we prove that it converges to a critical point of the nonconvex minimax optimization problem and achieves a computation complexity $\mathcal{O}(\kappa^{1.5}\epsilon^{-2})$, where $\epsilon$ is the desired level of accuracy and $\kappa$ is the problem's condition number. Such a computation complexity improves the state-of-the-art complexities of single-loop GDA and AltGDA algorithms (see the summary of comparison in Table 1). We demonstrate the effectiveness of our algorithm via an experiment on adversarial deep learning.
    Learning Agent State Online with Recurrent Generate-and-Test. (arXiv:2112.15236v1 [cs.LG])
    (0 min) Learning continually and online from a continuous stream of data is challenging, especially for a reinforcement learning agent with sequential data. When the environment only provides observations giving partial information about the state of the environment, the agent must learn the agent state based on the data stream of experience. We refer to the state learned directly from the data stream of experience as the agent state. Recurrent neural networks can learn the agent state, but the training methods are computationally expensive and sensitive to the hyper-parameters, making them unideal for online learning. This work introduces methods based on the generate-and-test approach to learn the agent state. A generate-and-test algorithm searches for state features by generating features and testing their usefulness. In this process, features useful for the agent's performance on the task are preserved, and the least useful features get replaced with newly generated features. We study the effectiveness of our methods on two online multi-step prediction problems. The first problem, trace conditioning, focuses on the agent's ability to remember a cue for a prediction multiple steps into the future. In the second problem, trace patterning, the agent needs to learn patterns in the observation signals and remember them for future predictions. We show that our proposed methods can effectively learn the agent state online and produce accurate predictions.
    Uniform-in-Phase-Space Data Selection with Iterative Normalizing Flows. (arXiv:2112.15446v1 [cs.LG])
    (0 min) Improvements in computational and experimental capabilities are rapidly increasing the amount of scientific data that is routinely generated. In applications that are constrained by memory and computational intensity, excessively large datasets may hinder scientific discovery, making data reduction a critical component of data-driven methods. Datasets are growing in two directions: the number of data points and their dimensionality. Whereas data compression techniques are concerned with reducing dimensionality, the focus here is on reducing the number of data points. A strategy is proposed to select data points such that they uniformly span the phase-space of the data. The algorithm proposed relies on estimating the probability map of the data and using it to construct an acceptance probability. An iterative method is used to accurately estimate the probability of the rare data points when only a small subset of the dataset is used to construct the probability map. Instead of binning the phase-space to estimate the probability map, its functional form is approximated with a normalizing flow. Therefore, the method naturally extends to high-dimensional datasets. The proposed framework is demonstrated as a viable pathway to enable data-efficient machine learning when abundant data is available. An implementation of the method is available in a companion repository (https://github.com/NREL/Phase-space-sampling).
    Triangular Flows for Generative Modeling: Statistical Consistency, Smoothness Classes, and Fast Rates. (arXiv:2112.15595v1 [stat.ML])
    (0 min) Triangular flows, also known as Kn\"{o}the-Rosenblatt measure couplings, comprise an important building block of normalizing flow models for generative modeling and density estimation, including popular autoregressive flow models such as real-valued non-volume preserving transformation models (Real NVP). We present statistical guarantees and sample complexity bounds for triangular flow statistical models. In particular, we establish the statistical consistency and the finite sample convergence rates of the Kullback-Leibler estimator of the Kn\"{o}the-Rosenblatt measure coupling using tools from empirical process theory. Our results highlight the anisotropic geometry of function classes at play in triangular flows, shed light on optimal coordinate ordering, and lead to statistical guarantees for Jacobian flows. We conduct numerical experiments on synthetic data to illustrate the practical implications of our theoretical findings.
    Improving Baselines in the Wild. (arXiv:2112.15550v1 [cs.LG])
    (0 min) We share our experience with the recently released WILDS benchmark, a collection of ten datasets dedicated to developing models and training strategies which are robust to domain shifts. Several experiments yield a couple of critical observations which we believe are of general interest for any future work on WILDS. Our study focuses on two datasets: iWildCam and FMoW. We show that (1) Conducting separate cross-validation for each evaluation metric is crucial for both datasets, (2) A weak correlation between validation and test performance might make model development difficult for iWildCam, (3) Minor changes in the training of hyper-parameters improve the baseline by a relatively large margin (mainly on FMoW), (4) There is a strong correlation between certain domains and certain target labels (mainly on iWildCam). To the best of our knowledge, no prior work on these datasets has reported these observations despite their obvious importance. Our code is public.
    Learning the Regularization in DCE-MR Image Reconstruction for Functional Imaging of Kidneys. (arXiv:2109.07548v2 [cs.LG] UPDATED)
    (0 min) Kidney DCE-MRI aims at both qualitative assessment of kidney anatomy and quantitative assessment of kidney function by estimating the tracer kinetic (TK) model parameters. Accurate estimation of TK model parameters requires an accurate measurement of the arterial input function (AIF) with high temporal resolution. Accelerated imaging is used to achieve high temporal resolution, which yields under-sampling artifacts in the reconstructed images. Compressed sensing (CS) methods offer a variety of reconstruction options. Most commonly, sparsity of temporal differences is encouraged for regularization to reduce artifacts. Increasing regularization in CS methods removes the ambient artifacts but also over-smooths the signal temporally which reduces the parameter estimation accuracy. In this work, we propose a single image trained deep neural network to reduce MRI under-sampling artifacts without reducing the accuracy of functional imaging markers. Instead of regularizing with a penalty term in optimization, we promote regularization by generating images from a lower dimensional representation. In this manuscript we motivate and explain the lower dimensional input design. We compare our approach to CS reconstructions with multiple regularization weights. Proposed approach results in kidney biomarkers that are highly correlated with the ground truth markers estimated using the CS reconstruction which was optimized for functional analysis. At the same time, the proposed approach reduces the artifacts in the reconstructed images.
    A sampling-based approach for efficient clustering in large datasets. (arXiv:2112.14793v1 [cs.LG])
    (0 min) We propose a simple and efficient clustering method for high-dimensional data with a large number of clusters. Our algorithm achieves high-performance by evaluating distances of datapoints with a subset of the cluster centres. Our contribution is substantially more efficient than k-means as it does not require an all to all comparison of data points and clusters. We show that the optimal solutions of our approximation are the same as in the exact solution. However, our approach is considerably more efficient at extracting these clusters compared to the state-of-the-art. We compare our approximation with the exact k-means and alternative approximation approaches on a series of standardised clustering tasks. For the evaluation, we consider the algorithmic complexity, including number of operations to convergence, and the stability of the results.
    NCIS: Neural Contextual Iterative Smoothing for Purifying Adversarial Perturbations. (arXiv:2106.11644v2 [cs.CV] UPDATED)
    (0 min) We propose a novel and effective purification based adversarial defense method against pre-processor blind white- and black-box attacks. Our method is computationally efficient and trained only with self-supervised learning on general images, without requiring any adversarial training or retraining of the classification model. We first show an empirical analysis on the adversarial noise, defined to be the residual between an original image and its adversarial example, has almost zero mean, symmetric distribution. Based on this observation, we propose a very simple iterative Gaussian Smoothing (GS) which can effectively smooth out adversarial noise and achieve substantially high robust accuracy. To further improve it, we propose Neural Contextual Iterative Smoothing (NCIS), which trains a blind-spot network (BSN) in a self-supervised manner to reconstruct the discriminative features of the original image that is also smoothed out by GS. From our extensive experiments on the large-scale ImageNet using four classification models, we show that our method achieves both competitive standard accuracy and state-of-the-art robust accuracy against most strong purifier-blind white- and black-box attacks. Also, we propose a new benchmark for evaluating a purification method based on commercial image classification APIs, such as AWS, Azure, Clarifai and Google. We generate adversarial examples by ensemble transfer-based black-box attack, which can induce complete misclassification of APIs, and demonstrate that our method can be used to increase adversarial robustness of APIs.
    An objective function for order preserving hierarchical clustering. (arXiv:2109.04266v2 [cs.LG] UPDATED)
    (0 min) We present an objective function for similarity based hierarchical clustering of partially ordered data that preserves the partial order. That is, if $x \le y$, and if $[x]$ and $[y]$ are the respective clusters of $x$ and $y$, then there is an order relation $\le'$ on the clusters for which $[x] \le' |y]$. The theory distinguishes itself from existing theories for clustering of ordered data in that the order relation and the similarity are combined into a bi-objective optimisation problem to obtain a hierarchical clustering seeking to satisfy both. In particular, the order relation is weighted in the range $[0,1]$, and if the similarity and the order relation are not aligned, then order preservation may have to yield in favor of clustering. Finding an optimal solution is NP-hard, so we provide a polynomial time approximation algorithm, with a relative performance guarantee of $O\!\left(\log^{3/2} \!\!\, n \right)$, based on successive applications of directed sparsest cut. We provide a demonstration on a benchmark dataset, showing that our method outperforms existing methods for order preserving hierarchical clustering with significant margin. The theory is an extension of the Dasgupta cost function for divisive hierarchical clustering.
    How Much Over-parameterization Is Sufficient to Learn Deep ReLU Networks?. (arXiv:1911.12360v4 [cs.LG] UPDATED)
    (0 min) A recent line of research on deep learning focuses on the extremely over-parameterized setting, and shows that when the network width is larger than a high degree polynomial of the training sample size $n$ and the inverse of the target error $\epsilon^{-1}$, deep neural networks learned by (stochastic) gradient descent enjoy nice optimization and generalization guarantees. Very recently, it is shown that under certain margin assumptions on the training data, a polylogarithmic width condition suffices for two-layer ReLU networks to converge and generalize (Ji and Telgarsky, 2019). However, whether deep neural networks can be learned with such a mild over-parameterization is still an open question. In this work, we answer this question affirmatively and establish sharper learning guarantees for deep ReLU networks trained by (stochastic) gradient descent. In specific, under certain assumptions made in previous work, our optimization and generalization guarantees hold with network width polylogarithmic in $n$ and $\epsilon^{-1}$. Our results push the study of over-parameterized deep neural networks towards more practical settings.
    Training and Generating Neural Networks in Compressed Weight Space. (arXiv:2112.15545v1 [cs.LG])
    (0 min) The inputs and/or outputs of some neural nets are weight matrices of other neural nets. Indirect encodings or end-to-end compression of weight matrices could help to scale such approaches. Our goal is to open a discussion on this topic, starting with recurrent neural networks for character-level language modelling whose weight matrices are encoded by the discrete cosine transform. Our fast weight version thereof uses a recurrent neural network to parameterise the compressed weights. We present experimental results on the enwik8 dataset.
    Actor Loss of Soft Actor Critic Explained. (arXiv:2112.15568v1 [cs.LG])
    (0 min) This technical report is devoted to explaining how the actor loss of soft actor critic is obtained, as well as the associated gradient estimate. It gives the necessary mathematical background to derive all the presented equations, from the theoretical actor loss to the one implemented in practice. This necessitates a comparison of the reparameterization trick used in soft actor critic with the nabla log trick, which leads to open questions regarding the most efficient method to use.
    Hierarchical forecasting with a top-down alignment of independent level forecasts. (arXiv:2103.08250v4 [stat.ML] UPDATED)
    (0 min) Hierarchical forecasting with intermittent time series is a challenge in both research and empirical studies. Extensive research focuses on improving the accuracy of each hierarchy, especially the intermittent time series at bottom levels. Then hierarchical reconciliation could be used to improve the overall performance further. In this paper, we present a \emph{hierarchical-forecasting-with-alignment} approach that treats the bottom level forecasts as mutable to ensure higher forecasting accuracy on the upper levels of the hierarchy. We employ a pure deep learning forecasting approach N-BEATS for continuous time series at the top levels and a widely used tree-based algorithm LightGBM for the intermittent time series at the bottom level. The \emph{hierarchical-forecasting-with-alignment} approach is a simple yet effective variant of the bottom-up method, accounting for biases that are difficult to observe at the bottom level. It allows suboptimal forecasts at the lower level to retain a higher overall performance. The approach in this empirical study was developed by the first author during the M5 Forecasting Accuracy competition, ranking second place. The method is also business orientated and could benefit for business strategic planning.
    A Two-Timescale Framework for Bilevel Optimization: Complexity Analysis and Application to Actor-Critic. (arXiv:2007.05170v3 [math.OC] UPDATED)
    (0 min) This paper analyzes a two-timescale stochastic algorithm framework for bilevel optimization. Bilevel optimization is a class of problems which exhibit a two-level structure, and its goal is to minimize an outer objective function with variables which are constrained to be the optimal solution to an (inner) optimization problem. We consider the case when the inner problem is unconstrained and strongly convex, while the outer problem is constrained and has a smooth objective function. We propose a two-timescale stochastic approximation (TTSA) algorithm for tackling such a bilevel problem. In the algorithm, a stochastic gradient update with a larger step size is used for the inner problem, while a projected stochastic gradient update with a smaller step size is used for the outer problem. We analyze the convergence rates for the TTSA algorithm under various settings: when the outer problem is strongly convex (resp.~weakly convex), the TTSA algorithm finds an $\mathcal{O}(K^{-2/3})$-optimal (resp.~$\mathcal{O}(K^{-2/5})$-stationary) solution, where $K$ is the total iteration number. As an application, we show that a two-timescale natural actor-critic proximal policy optimization algorithm can be viewed as a special case of our TTSA framework. Importantly, the natural actor-critic algorithm is shown to converge at a rate of $\mathcal{O}(K^{-1/4})$ in terms of the gap in expected discounted reward compared to a global optimal policy.
    Efficient Robust Training via Backward Smoothing. (arXiv:2010.01278v2 [cs.LG] UPDATED)
    (0 min) Adversarial training is so far the most effective strategy in defending against adversarial examples. However, it suffers from high computational costs due to the iterative adversarial attacks in each training step. Recent studies show that it is possible to achieve fast Adversarial Training by performing a single-step attack with random initialization. However, such an approach still lags behind state-of-the-art adversarial training algorithms on both stability and model robustness. In this work, we develop a new understanding towards Fast Adversarial Training, by viewing random initialization as performing randomized smoothing for better optimization of the inner maximization problem. Following this new perspective, we also propose a new initialization strategy, backward smoothing, to further improve the stability and model robustness over single-step robust training methods. Experiments on multiple benchmarks demonstrate that our method achieves similar model robustness as the original TRADES method while using much less training time ($\sim$3x improvement with the same training schedule).
    Refining Language Models with Compositional Explanations. (arXiv:2103.10415v3 [cs.CL] UPDATED)
    (0 min) Pre-trained language models have been successful on text classification tasks, but are prone to learning spurious correlations from biased datasets, and are thus vulnerable when making inferences in a new domain. Prior work reveals such spurious patterns via post-hoc explanation algorithms which compute the importance of input features. Further, the model is regularized to align the importance scores with human knowledge, so that the unintended model behaviors are eliminated. However, such a regularization technique lacks flexibility and coverage, since only importance scores towards a pre-defined list of features are adjusted, while more complex human knowledge such as feature interaction and pattern generalization can hardly be incorporated. In this work, we propose to refine a learned language model for a target domain by collecting human-provided compositional explanations regarding observed biases. By parsing these explanations into executable logic rules, the human-specified refinement advice from a small set of explanations can be generalized to more training examples. We additionally introduce a regularization term allowing adjustments for both importance and interaction of features to better rectify model behavior. We demonstrate the effectiveness of the proposed approach on two text classification tasks by showing improved performance in target domain as well as improved model fairness after refinement.
    Separation of scales and a thermodynamic description of feature learning in some CNNs. (arXiv:2112.15383v1 [stat.ML])
    (0 min) Deep neural networks (DNNs) are powerful tools for compressing and distilling information. Due to their scale and complexity, often involving billions of inter-dependent internal degrees of freedom, exact analysis approaches often fall short. A common strategy in such cases is to identify slow degrees of freedom that average out the erratic behavior of the underlying fast microscopic variables. Here, we identify such a separation of scales occurring in over-parameterized deep convolutional neural networks (CNNs) at the end of training. It implies that neuron pre-activations fluctuate in a nearly Gaussian manner with a deterministic latent kernel. While for CNNs with infinitely many channels these kernels are inert, for finite CNNs they adapt and learn from data in an analytically tractable manner. The resulting thermodynamic theory of deep learning yields accurate predictions on several deep non-linear CNN toy models. In addition, it provides new ways of analyzing and understanding CNNs.
    An Unsupervised Domain Adaptation Model based on Dual-module Adversarial Training. (arXiv:2112.15555v1 [cs.LG])
    (0 min) In this paper, we propose a dual-module network architecture that employs a domain discriminative feature module to encourage the domain invariant feature module to learn more domain invariant features. The proposed architecture can be applied to any model that utilizes domain invariant features for unsupervised domain adaptation to improve its ability to extract domain invariant features. We conduct experiments with the Domain-Adversarial Training of Neural Networks (DANN) model as a representative algorithm. In the training process, we supply the same input to the two modules and then extract their feature distribution and prediction results respectively. We propose a discrepancy loss to find the discrepancy of the prediction results and the feature distribution between the two modules. Through the adversarial training by maximizing the loss of their feature distribution and minimizing the discrepancy of their prediction results, the two modules are encouraged to learn more domain discriminative and domain invariant features respectively. Extensive comparative evaluations are conducted and the proposed approach outperforms the state-of-the-art in most unsupervised domain adaptation tasks.
    A Critical Review of Inductive Logic Programming Techniques for Explainable AI. (arXiv:2112.15319v1 [cs.LG])
    (0 min) Despite recent advances in modern machine learning algorithms, the opaqueness of their underlying mechanisms continues to be an obstacle in adoption. To instill confidence and trust in artificial intelligence systems, Explainable Artificial Intelligence has emerged as a response to improving modern machine learning algorithms' explainability. Inductive Logic Programming (ILP), a subfield of symbolic artificial intelligence, plays a promising role in generating interpretable explanations because of its intuitive logic-driven framework. ILP effectively leverages abductive reasoning to generate explainable first-order clausal theories from examples and background knowledge. However, several challenges in developing methods inspired by ILP need to be addressed for their successful application in practice. For example, existing ILP systems often have a vast solution space, and the induced solutions are very sensitive to noises and disturbances. This survey paper summarizes the recent advances in ILP and a discussion of statistical relational learning and neural-symbolic algorithms, which offer synergistic views to ILP. Following a critical review of the recent advances, we delineate observed challenges and highlight potential avenues of further ILP-motivated research toward developing self-explanatory artificial intelligence systems.
    Simplifying Software Defect Prediction (via the "early bird" Heuristic). (arXiv:2105.11082v2 [cs.SE] UPDATED)
    (0 min) Before researchers rush to reason across all available data or try complex methods, perhaps it is prudent to first check for simpler alternatives. Specifically, if the historical data has the most information in some small region, then perhaps a model learned from that region would suffice for the rest of the project. To support this claim, we offer a case study with 240 GitHub projects, where we find that the information in those projects "clumped" towards the earliest parts of the project. A defect prediction model learned from just the first 150 commits works as well, or better than state-of-the-art alternatives. Using just this early life cycle data, we can build models very quickly, very early in the software project life cycle. Moreover, using this method, we have shown that a simple model (with just two features) generalizes to hundreds of software projects. Based on this experience, we doubt that prior work on generalizing software engineering defect prediction models may have needlessly complicated an inherently simple process. Further, prior work that focused on later-life cycle data needs to be revisited since their conclusions were drawn from relatively uninformative regions. Replication note: all our data and scripts are online at https://github.com/snaraya7/simplifying-software-analytics
    Hybrid Adversarial Imitation Learning. (arXiv:2102.02454v10 [cs.LG] UPDATED)
    (0 min) Extrapolating beyond-demonstrator (BD) performance through the imitation learning (IL) algorithm aims to learn from and outperform the demonstrator. Most existing BDIL algorithms are performed in two stages by first inferring a reward function before learning a policy via reinforcement learning (RL). However, such two-stage BDIL algorithms suffer from high computational complexity, weak robustness, and large performance variations. In particular, a poor reward function derived in the first stage will inevitably incur severe performance loss in the second stage. In this work, we propose a hybrid adversarial imitation learning (HAIL) algorithm that is one-stage, model-free, generative-adversarial (GA) fashion and curiosity-driven. Thanks to the one-stage design, the HAIL can integrate both the reward function learning and the policy optimization into one procedure, which leads to many advantages such as low computational complexity, high robustness, and strong adaptability. More specifically, HAIL simultaneously imitates the demonstrator and explores BD performance by utilizing hybrid rewards. Extensive simulation results confirm that HAIL can achieve higher performance as compared to other similar BDIL algorithms.
    Fast Learning of MNL Model from General Partial Rankings with Application to Network Formation Modeling. (arXiv:2112.15575v1 [cs.LG])
    (0 min) Multinomial Logit (MNL) is one of the most popular discrete choice models and has been widely used to model ranking data. However, there is a long-standing technical challenge of learning MNL from many real-world ranking data: exact calculation of the MNL likelihood of \emph{partial rankings} is generally intractable. In this work, we develop a scalable method for approximating the MNL likelihood of general partial rankings in polynomial time complexity. We also extend the proposed method to learn mixture of MNL. We demonstrate that the proposed methods are particularly helpful for applications to choice-based network formation modeling, where the formation of new edges in a network is viewed as individuals making choices of their friends over a candidate set. The problem of learning mixture of MNL models from partial rankings naturally arises in such applications. And the proposed methods can be used to learn MNL models from network data without the strong assumption that temporal orders of all the edge formation are available. We conduct experiments on both synthetic and real-world network data to demonstrate that the proposed methods achieve more accurate parameter estimation and better fitness of data compared to conventional methods.
    Efficient and Reliable Overlay Networks for Decentralized Federated Learning. (arXiv:2112.15486v1 [cs.NI])
    (0 min) We propose near-optimal overlay networks based on $d$-regular expander graphs to accelerate decentralized federated learning (DFL) and improve its generalization. In DFL a massive number of clients are connected by an overlay network, and they solve machine learning problems collaboratively without sharing raw data. Our overlay network design integrates spectral graph theory and the theoretical convergence and generalization bounds for DFL. As such, our proposed overlay networks accelerate convergence, improve generalization, and enhance robustness to clients failures in DFL with theoretical guarantees. Also, we present an efficient algorithm to convert a given graph to a practical overlay network and maintaining the network topology after potential client failures. We numerically verify the advantages of DFL with our proposed networks on various benchmark tasks, ranging from image classification to language modeling using hundreds of clients.
    ViNMT: Neural Machine Translation Tookit. (arXiv:2112.15272v1 [cs.CL])
    (0 min) We present an open-source toolkit for neural machine translation (NMT). The new toolkit is mainly based on vaulted Transformer (Vaswani et al., 2017) along with many other improvements detailed below, in order to create a self-contained, simple to use, consistent and comprehensive framework for Machine Translation tasks of various domains. It is tooled to support both bilingual and multilingual translation tasks, starting from building the model from respective corpora, to inferring new predictions or packaging the model to serving-capable JIT format.
    AutoFITS: Automatic Feature Engineering for Irregular Time Series. (arXiv:2112.14806v1 [cs.LG])
    (0 min) A time series represents a set of observations collected over time. Typically, these observations are captured with a uniform sampling frequency (e.g. daily). When data points are observed in uneven time intervals the time series is referred to as irregular or intermittent. In such scenarios, the most common solution is to reconstruct the time series to make it regular, thus removing its intermittency. We hypothesise that, in irregular time series, the time at which each observation is collected may be helpful to summarise the dynamics of the data and improve forecasting performance. We study this idea by developing a novel automatic feature engineering framework, which focuses on extracting information from this point of view, i.e., when each instance is collected. We study how valuable this information is by integrating it in a time series forecasting workflow and investigate how it compares to or complements state-of-the-art methods for regular time series forecasting. In the end, we contribute by providing a novel framework that tackles feature engineering for time series from an angle previously vastly ignored. We show that our approach has the potential to further extract more information about time series that significantly improves forecasting performance.
    Investigating Pose Representations and Motion Contexts Modeling for 3D Motion Prediction. (arXiv:2112.15012v1 [cs.CV])
    (0 min) Predicting human motion from historical pose sequence is crucial for a machine to succeed in intelligent interactions with humans. One aspect that has been obviated so far, is the fact that how we represent the skeletal pose has a critical impact on the prediction results. Yet there is no effort that investigates across different pose representation schemes. We conduct an indepth study on various pose representations with a focus on their effects on the motion prediction task. Moreover, recent approaches build upon off-the-shelf RNN units for motion prediction. These approaches process input pose sequence sequentially and inherently have difficulties in capturing long-term dependencies. In this paper, we propose a novel RNN architecture termed AHMR (Attentive Hierarchical Motion Recurrent network) for motion prediction which simultaneously models local motion contexts and a global context. We further explore a geodesic loss and a forward kinematics loss for the motion prediction task, which have more geometric significance than the widely employed L2 loss. Interestingly, we applied our method to a range of articulate objects including human, fish, and mouse. Empirical results show that our approach outperforms the state-of-the-art methods in short-term prediction and achieves much enhanced long-term prediction proficiency, such as retaining natural human-like motions over 50 seconds predictions. Our codes are released.
    A General Traffic Shaping Protocol in E-Commerce. (arXiv:2112.14941v1 [cs.LG])
    (0 min) To approach different business objectives, online traffic shaping algorithms aim at improving exposures of a target set of items, such as boosting the growth of new commodities. Generally, these algorithms assume that the utility of each user-item pair can be accessed via a well-trained conversion rate prediction model. However, for real E-Commerce platforms, there are unavoidable factors preventing us from learning such an accurate model. In order to break the heavy dependence on accurate inputs of the utility, we propose a general online traffic shaping protocol for online E-Commerce applications. In our framework, we approximate the function mapping the bonus scores, which generally are the only method to influence the ranking result in the traffic shaping problem, to the numbers of exposures and purchases. Concretely, we approximate the above function by a class of the piece-wise linear function constructed on the convex hull of the explored data points. Moreover, we reformulate the online traffic shaping problem as linear programming where these piece-wise linear functions are embedded into both the objective and constraints. Our algorithm can straightforwardly optimize the linear programming in the prime space, and its solution can be simply applied by a stochastic strategy to fulfill the optimized objective and the constraints in expectation. Finally, the online A/B test shows our proposed algorithm steadily outperforms the previous industrial level traffic shaping algorithm.
    BP-Net: Cuff-less, Calibration-free, and Non-invasive Blood Pressure Estimation via a Generic Deep Convolutional Architecture. (arXiv:2112.15271v1 [cs.LG])
    (0 min) Objective: The paper focuses on development of robust and accurate processing solutions for continuous and cuff-less blood pressure (BP) monitoring. In this regard, a robust deep learning-based framework is proposed for computation of low latency, continuous, and calibration-free upper and lower bounds on the systolic and diastolic BP. Method: Referred to as the BP-Net, the proposed framework is a novel convolutional architecture that provides longer effective memory while achieving superior performance due to incorporation of casual dialated convolutions and residual connections. To utilize the real potential of deep learning in extraction of intrinsic features (deep features) and enhance the long-term robustness, the BP-Net uses raw Electrocardiograph (ECG) and Photoplethysmograph (PPG) signals without extraction of any form of hand-crafted features as it is common in existing solutions. Results: By capitalizing on the fact that datasets used in recent literature are not unified and properly defined, a benchmark dataset is constructed from the MIMIC-I and MIMIC-III databases obtained from PhysioNet. The proposed BP-Net is evaluated based on this benchmark dataset demonstrating promising performance and shows superior generalizable capacity. Conclusion: The proposed BP-Net architecture is more accurate than canonical recurrent networks and enhances the long-term robustness of the BP estimation task. Significance: The proposed BP-Net architecture addresses key drawbacks of existing BP estimation solutions, i.e., relying heavily on extraction of hand-crafted features, such as pulse arrival time (PAT), and; Lack of robustness. Finally, the constructed BP-Net dataset provides a unified base for evaluation and comparison of deep learning-based BP estimation algorithms.
    Calibrated Hyperspectral Image Reconstruction via Graph-based Self-Tuning Network. (arXiv:2112.15362v1 [eess.IV])
    (0 min) Recently, hyperspectral imaging (HSI) has attracted increasing research attention, especially for the ones based on a coded aperture snapshot spectral imaging (CASSI) system. Existing deep HSI reconstruction models are generally trained on paired data to retrieve original signals upon 2D compressed measurements given by a particular optical hardware mask in CASSI, during which the mask largely impacts the reconstruction performance and could work as a "model hyperparameter" governing on data augmentations. This mask-specific training style will lead to a hardware miscalibration issue, which sets up barriers to deploying deep HSI models among different hardware and noisy environments. To address this challenge, we introduce mask uncertainty for HSI with a complete variational Bayesian learning treatment and explicitly model it through a mask decomposition inspired by real hardware. Specifically, we propose a novel Graph-based Self-Tuning (GST) network to reason uncertainties adapting to varying spatial structures of masks among different hardware. Moreover, we develop a bilevel optimization framework to balance HSI reconstruction and uncertainty estimation, accounting for the hyperparameter property of masks. Extensive experimental results and model discussions validate the effectiveness (over 33/30 dB) of the proposed GST method under two miscalibration scenarios and demonstrate a highly competitive performance compared with the state-of-the-art well-calibrated methods. Our code and pre-trained model are available at https://github.com/Jiamian Wang/mask_uncertainty_spectral_SCI
    Multi-Agent Reinforcement Learning via Adaptive Kalman Temporal Difference and Successor Representation. (arXiv:2112.15156v1 [cs.LG])
    (0 min) Distributed Multi-Agent Reinforcement Learning (MARL) algorithms has attracted a surge of interest lately mainly due to the recent advancements of Deep Neural Networks (DNNs). Conventional Model-Based (MB) or Model-Free (MF) RL algorithms are not directly applicable to the MARL problems due to utilization of a fixed reward model for learning the underlying value function. While DNN-based solutions perform utterly well when a single agent is involved, such methods fail to fully generalize to the complexities of MARL problems. In other words, although recently developed approaches based on DNNs for multi-agent environments have achieved superior performance, they are still prone to overfiting, high sensitivity to parameter selection, and sample inefficiency. The paper proposes the Multi-Agent Adaptive Kalman Temporal Difference (MAK-TD) framework and its Successor Representation-based variant, referred to as the MAK-SR. Intuitively speaking, the main objective is to capitalize on unique characteristics of Kalman Filtering (KF) such as uncertainty modeling and online second order learning. The proposed MAK-TD/SR frameworks consider the continuous nature of the action-space that is associated with high dimensional multi-agent environments and exploit Kalman Temporal Difference (KTD) to address the parameter uncertainty. By leveraging the KTD framework, SR learning procedure is modeled into a filtering problem, where Radial Basis Function (RBF) estimators are used to encode the continuous space into feature vectors. On the other hand, for learning localized reward functions, we resort to Multiple Model Adaptive Estimation (MMAE), to deal with the lack of prior knowledge on the observation noise covariance and observation mapping function. The proposed MAK-TD/SR frameworks are evaluated via several experiments, which are implemented through the OpenAI Gym MARL benchmarks.
    Robustness and risk management via distributional dynamic programming. (arXiv:2112.15430v1 [cs.LG])
    (0 min) In dynamic programming (DP) and reinforcement learning (RL), an agent learns to act optimally in terms of expected long-term return by sequentially interacting with its environment modeled by a Markov decision process (MDP). More generally in distributional reinforcement learning (DRL), the focus is on the whole distribution of the return, not just its expectation. Although DRL-based methods produced state-of-the-art performance in RL with function approximation, they involve additional quantities (compared to the non-distributional setting) that are still not well understood. As a first contribution, we introduce a new class of distributional operators, together with a practical DP algorithm for policy evaluation, that come with a robust MDP interpretation. Indeed, our approach reformulates through an augmented state space where each state is split into a worst-case substate and a best-case substate, whose values are maximized by safe and risky policies respectively. Finally, we derive distributional operators and DP algorithms solving a new control task: How to distinguish safe from risky optimal actions in order to break ties in the space of optimal policies?
    Self Reward Design with Fine-grained Interpretability. (arXiv:2112.15034v1 [cs.LG])
    (0 min) Transparency and fairness issues in Deep Reinforcement Learning may stem from the black-box nature of deep neural networks used to learn its policy, value functions etc. This paper proposes a way to circumvent the issues through the bottom-up design of neural networks (NN) with detailed interpretability, where each neuron or layer has its own meaning and utility that corresponds to humanly understandable concept. With deliberate design, we show that lavaland problems can be solved using NN model with few parameters. Furthermore, we introduce the Self Reward Design (SRD), inspired by the Inverse Reward Design, so that our interpretable design can (1) solve the problem by pure design (although imperfectly) (2) be optimized via SRD (3) perform avoidance of unknown states by recognizing the inactivations of neurons aggregated as the activation in \(w_{unknown}\).
    Distributed Random Reshuffling over Networks. (arXiv:2112.15287v1 [math.OC])
    (0 min) In this paper, we consider the distributed optimization problem where $n$ agents, each possessing a local cost function, collaboratively minimize the average of the local cost functions over a connected network. To solve the problem, we propose a distributed random reshuffling (D-RR) algorithm that combines the classical distributed gradient descent (DGD) method and Random Reshuffling (RR). We show that D-RR inherits the superiority of RR for both smooth strongly convex and smooth nonconvex objective functions. In particular, for smooth strongly convex objective functions, D-RR achieves $\mathcal{O}(1/T^2)$ rate of convergence (here, $T$ counts the total number of iterations) in terms of the squared distance between the iterate and the unique minimizer. When the objective function is assumed to be smooth nonconvex and has Lipschitz continuous component functions, we show that D-RR drives the squared norm of gradient to $0$ at a rate of $\mathcal{O}(1/T^{2/3})$. These convergence results match those of centralized RR (up to constant factors).
    Scalable Deep Graph Clustering with Random-walk based Self-supervised Learning. (arXiv:2112.15530v1 [cs.LG])
    (0 min) Web-based interactions can be frequently represented by an attributed graph, and node clustering in such graphs has received much attention lately. Multiple efforts have successfully applied Graph Convolutional Networks (GCN), though with some limits on accuracy as GCNs have been shown to suffer from over-smoothing issues. Though other methods (particularly those based on Laplacian Smoothing) have reported better accuracy, a fundamental limitation of all the work is a lack of scalability. This paper addresses this open problem by relating the Laplacian smoothing to the Generalized PageRank and applying a random-walk based algorithm as a scalable graph filter. This forms the basis for our scalable deep clustering algorithm, RwSL, where through a self-supervised mini-batch training mechanism, we simultaneously optimize a deep neural network for sample-cluster assignment distribution and an autoencoder for a clustering-oriented embedding. Using 6 real-world datasets and 6 clustering metrics, we show that RwSL achieved improved results over several recent baselines. Most notably, we show that RwSL, unlike all other deep clustering frameworks, can continue to scale beyond graphs with more than one million nodes, i.e., handle web-scale. We also demonstrate how RwSL could perform node clustering on a graph with 1.8 billion edges using only a single GPU.
    Shift-Equivariant Similarity-Preserving Hypervector Representations of Sequences. (arXiv:2112.15475v1 [cs.AI])
    (0 min) Hyperdimensional Computing (HDC), also known as Vector-Symbolic Architectures (VSA), is a promising framework for the development of cognitive architectures and artificial intelligence systems, as well as for technical applications and emerging neuromorphic and nanoscale hardware. HDC/VSA operate with hypervectors, i.e., distributed vector representations of large fixed dimension (usually > 1000). One of the key ingredients of HDC/VSA are the methods for encoding data of various types (from numeric scalars and vectors to graphs) into hypervectors. In this paper, we propose an approach for the formation of hypervectors of sequences that provides both an equivariance with respect to the shift of sequences and preserves the similarity of sequences with identical elements at nearby positions. Our methods represent the sequence elements by compositional hypervectors and exploit permutations of hypervectors for representing the order of sequence elements. We experimentally explored the proposed representations using a diverse set of tasks with data in the form of symbolic strings. Although our approach is feature-free as it forms the hypervector of a sequence from the hypervectors of its symbols at their positions, it demonstrated the performance on a par with the methods that apply various features, such as subsequences. The proposed techniques were designed for the HDC/VSA model known as Sparse Binary Distributed Representations. However, they can be adapted to hypervectors in formats of other HDC/VSA models, as well as for representing sequences of types other than symbolic strings.
    Modelling of Bi-directional Spatio-Temporal Dependence and Users' Dynamic Preferences for Missing POI Check-in Identification. (arXiv:2112.15285v1 [cs.LG])
    (0 min) Human mobility data accumulated from Point-of-Interest (POI) check-ins provides great opportunity for user behavior understanding. However, data quality issues (e.g., geolocation information missing, unreal check-ins, data sparsity) in real-life mobility data limit the effectiveness of existing POI-oriented studies, e.g., POI recommendation and location prediction, when applied to real applications. To this end, in this paper, we develop a model, named Bi-STDDP, which can integrate bi-directional spatio-temporal dependence and users' dynamic preferences, to identify the missing POI check-in where a user has visited at a specific time. Specifically, we first utilize bi-directional global spatial and local temporal information of POIs to capture the complex dependence relationships. Then, target temporal pattern in combination with user and POI information are fed into a multi-layer network to capture users' dynamic preferences. Moreover, the dynamic preferences are transformed into the same space as the dependence relationships to form the final model. Finally, the proposed model is evaluated on three large-scale real-world datasets and the results demonstrate significant improvements of our model compared with state-of-the-art methods. Also, it is worth noting that the proposed model can be naturally extended to address POI recommendation and location prediction tasks with competitive performances.
    Exploring the pattern of Emotion in children with ASD as an early biomarker through Recurring-Convolution Neural Network (R-CNN). (arXiv:2112.14983v1 [cs.CV])
    (0 min) Autism Spectrum Disorder (ASD) is found to be a major concern among various occupational therapists. The foremost challenge of this neurodevelopmental disorder lies in the fact of analyzing and exploring various symptoms of the children at their early stage of development. Such early identification could prop up the therapists and clinicians to provide proper assistive support to make the children lead an independent life. Facial expressions and emotions perceived by the children could contribute to such early intervention of autism. In this regard, the paper implements in identifying basic facial expression and exploring their emotions upon a time variant factor. The emotions are analyzed by incorporating the facial expression identified through CNN using 68 landmark points plotted on the frontal face with a prediction network formed by RNN known as RCNN-FER system. The paper adopts R-CNN to take the advantage of increased accuracy and performance with decreased time complexity in predicting emotion as a textual network analysis. The papers proves better accuracy in identifying the emotion in autistic children when compared over simple machine learning models built for such identifications contributing to autistic society.
    Derivative-Free Policy Optimization for Linear Risk-Sensitive and Robust Control Design: Implicit Regularization and Sample Complexity. (arXiv:2101.01041v3 [math.OC] UPDATED)
    (0 min) Direct policy search serves as one of the workhorses in modern reinforcement learning (RL), and its applications in continuous control tasks have recently attracted increasing attention. In this work, we investigate the convergence theory of policy gradient (PG) methods for learning the linear risk-sensitive and robust controller. In particular, we develop PG methods that can be implemented in a derivative-free fashion by sampling system trajectories, and establish both global convergence and sample complexity results in the solutions of two fundamental settings in risk-sensitive and robust control: the finite-horizon linear exponential quadratic Gaussian, and the finite-horizon linear-quadratic disturbance attenuation problems. As a by-product, our results also provide the first sample complexity for the global convergence of PG methods on solving zero-sum linear-quadratic dynamic games, a nonconvex-nonconcave minimax optimization problem that serves as a baseline setting in multi-agent reinforcement learning (MARL) with continuous spaces. One feature of our algorithms is that during the learning phase, a certain level of robustness/risk-sensitivity of the controller is preserved, which we termed as the implicit regularization property, and is an essential requirement in safety-critical control systems.
    ChunkFormer: Learning Long Time Series with Multi-stage Chunked Transformer. (arXiv:2112.15087v1 [cs.LG])
    (0 min) The analysis of long sequence data remains challenging in many real-world applications. We propose a novel architecture, ChunkFormer, that improves the existing Transformer framework to handle the challenges while dealing with long time series. Original Transformer-based models adopt an attention mechanism to discover global information along a sequence to leverage the contextual data. Long sequential data traps local information such as seasonality and fluctuations in short data sequences. In addition, the original Transformer consumes more resources by carrying the entire attention matrix during the training course. To overcome these challenges, ChunkFormer splits the long sequences into smaller sequence chunks for the attention calculation, progressively applying different chunk sizes in each stage. In this way, the proposed model gradually learns both local and global information without changing the total length of the input sequences. We have extensively tested the effectiveness of this new architecture on different business domains and have proved the advantage of such a model over the existing Transformer-based models.
    Dynamic Causal Effects Evaluation in A/B Testing with a Reinforcement Learning Framework. (arXiv:2002.01711v5 [cs.LG] UPDATED)
    (0 min) A/B testing, or online experiment is a standard business strategy to compare a new product with an old one in pharmaceutical, technological, and traditional industries. Major challenges arise in online experiments of two-sided marketplace platforms (e.g., Uber) where there is only one unit that receives a sequence of treatments over time. In those experiments, the treatment at a given time impacts current outcome as well as future outcomes. The aim of this paper is to introduce a reinforcement learning framework for carrying A/B testing in these experiments, while characterizing the long-term treatment effects. Our proposed testing procedure allows for sequential monitoring and online updating. It is generally applicable to a variety of treatment designs in different industries. In addition, we systematically investigate the theoretical properties (e.g., size and power) of our testing procedure. Finally, we apply our framework to both simulated data and a real-world data example obtained from a technological company to illustrate its advantage over the current practice. A Python implementation of our test is available at https://github.com/callmespring/CausalRL.
    Adversarial Learning for Incentive Optimization in Mobile Payment Marketing. (arXiv:2112.15434v1 [cs.LG])
    (0 min) Many payment platforms hold large-scale marketing campaigns, which allocate incentives to encourage users to pay through their applications. To maximize the return on investment, incentive allocations are commonly solved in a two-stage procedure. After training a response estimation model to estimate the users' mobile payment probabilities (MPP), a linear programming process is applied to obtain the optimal incentive allocation. However, the large amount of biased data in the training set, generated by the previous biased allocation policy, causes a biased estimation. This bias deteriorates the performance of the response model and misleads the linear programming process, dramatically degrading the performance of the resulting allocation policy. To overcome this obstacle, we propose a bias correction adversarial network. Our method leverages the small set of unbiased data obtained under a full-randomized allocation policy to train an unbiased model and then uses it to reduce the bias with adversarial learning. Offline and online experimental results demonstrate that our method outperforms state-of-the-art approaches and significantly improves the performance of the resulting allocation policy in a real-world marketing campaign.
    Studying the Interplay between Information Loss and Operation Loss in Representations for Classification. (arXiv:2112.15238v1 [cs.LG])
    (0 min) Information-theoretic measures have been widely adopted in the design of features for learning and decision problems. Inspired by this, we look at the relationship between i) a weak form of information loss in the Shannon sense and ii) the operation loss in the minimum probability of error (MPE) sense when considering a family of lossy continuous representations (features) of a continuous observation. We present several results that shed light on this interplay. Our first result offers a lower bound on a weak form of information loss as a function of its respective operation loss when adopting a discrete lossy representation (quantization) instead of the original raw observation. From this, our main result shows that a specific form of vanishing information loss (a weak notion of asymptotic informational sufficiency) implies a vanishing MPE loss (or asymptotic operational sufficiency) when considering a general family of lossy continuous representations. Our theoretical findings support the observation that the selection of feature representations that attempt to capture informational sufficiency is appropriate for learning, but this selection is a rather conservative design principle if the intended goal is achieving MPE in classification. Supporting this last point, and under some structural conditions, we show that it is possible to adopt an alternative notion of informational sufficiency (strictly weaker than pure sufficiency in the mutual information sense) to achieve operational sufficiency in learning.
    PINNs for the Solution of the Hyperbolic Buckley-Leverett Problem with a Non-convex Flux Function. (arXiv:2112.14826v1 [physics.flu-dyn])
    (0 min) The displacement of two immiscible fluids is a common problem in fluid flow in porous media. Such a problem can be posed as a partial differential equation (PDE) in what is commonly referred to as a Buckley-Leverett (B-L) problem. The B-L problem is a non-linear hyperbolic conservation law that is known to be notoriously difficult to solve using traditional numerical methods. Here, we address the forward hyperbolic B-L problem with a nonconvex flux function using physics-informed neural networks (PINNs). The contributions of this paper are twofold. First, we present a PINN approach to solve the hyperbolic B-L problem by embedding the Oleinik entropy condition into the neural network residual. We do not use a diffusion term (artificial viscosity) in the residual-loss, but we rely on the strong form of the PDE. Second, we use the Adam optimizer with residual-based adaptive refinement (RAR) algorithm to achieve an ultra-low loss without weighting. Our solution method can accurately capture the shock-front and produce an accurate overall solution. We report a L2 validation error of 2 x 10-2 and a L2 loss of 1x 10-6. The proposed method does not require any additional regularization or weighting of losses to obtain such accurate solution.
    SplitBrain: Hybrid Data and Model Parallel Deep Learning. (arXiv:2112.15317v1 [cs.LG])
    (0 min) The recent success of deep learning applications has coincided with those widely available powerful computational resources for training sophisticated machine learning models with huge datasets. Nonetheless, training large models such as convolutional neural networks using model parallelism (as opposed to data parallelism) is challenging because the complex nature of communication between model shards makes it difficult to partition the computation efficiently across multiple machines with an acceptable trade-off. This paper presents SplitBrain, a high performance distributed deep learning framework supporting hybrid data and model parallelism. Specifically, SplitBrain provides layer-specific partitioning that co-locates compute intensive convolutional layers while sharding memory demanding layers. A novel scalable group communication is proposed to further improve the training throughput with reduced communication overhead. The results show that SplitBrain can achieve nearly linear speedup while saving up to 67\% of memory consumption for data and model parallel VGG over CIFAR-10.
    Deconfounded Training for Graph Neural Networks. (arXiv:2112.15089v1 [cs.LG])
    (0 min) Learning powerful representations is one central theme of graph neural networks (GNNs). It requires refining the critical information from the input graph, instead of the trivial patterns, to enrich the representations. Towards this end, graph attention and pooling methods prevail. They mostly follow the paradigm of "learning to attend". It maximizes the mutual information between the attended subgraph and the ground-truth label. However, this training paradigm is prone to capture the spurious correlations between the trivial subgraph and the label. Such spurious correlations are beneficial to in-distribution (ID) test evaluations, but cause poor generalization in the out-of-distribution (OOD) test data. In this work, we revisit the GNN modeling from the causal perspective. On the top of our causal assumption, the trivial information serves as a confounder between the critical information and the label, which opens a backdoor path between them and makes them spuriously correlated. Hence, we present a new paradigm of deconfounded training (DTP) that better mitigates the confounding effect and latches on the critical information, to enhance the representation and generalization ability. Specifically, we adopt the attention modules to disentangle the critical subgraph and trivial subgraph. Then we make each critical subgraph fairly interact with diverse trivial subgraphs to achieve a stable prediction. It allows GNNs to capture a more reliable subgraph whose relation with the label is robust across different distributions. We conduct extensive experiments on synthetic and real-world datasets to demonstrate the effectiveness.
    Music-to-Dance Generation with Optimal Transport. (arXiv:2112.01806v1 [cs.SD] CROSS LISTED)
    (0 min) Dance choreography for a piece of music is a challenging task, having to be creative in presenting distinctive stylistic dance elements while taking into account the musical theme and rhythm. It has been tackled by different approaches such as similarity retrieval, sequence-to-sequence modeling and generative adversarial networks, but their generated dance sequences are often short of motion realism, diversity and music consistency. In this paper, we propose a Music-to-Dance with Optimal Transport Network (MDOT-Net) for learning to generate 3D dance choreographs from music. We introduce an optimal transport distance for evaluating the authenticity of the generated dance distribution and a Gromov-Wasserstein distance to measure the correspondence between the dance distribution and the input music. This gives a well defined and non-divergent training objective that mitigates the limitation of standard GAN training which is frequently plagued with instability and divergent generator loss issues. Extensive experiments demonstrate that our MDOT-Net can synthesize realistic and diverse dances which achieve an organic unity with the input music, reflecting the shared intentionality and matching the rhythmic articulation.
    Machine Learning Application Development: Practitioners' Insights. (arXiv:2112.15277v1 [cs.SE])
    (0 min) Nowadays, intelligent systems and services are getting increasingly popular as they provide data-driven solutions to diverse real-world problems, thanks to recent breakthroughs in Artificial Intelligence (AI) and Machine Learning (ML). However, machine learning meets software engineering not only with promising potentials but also with some inherent challenges. Despite some recent research efforts, we still do not have a clear understanding of the challenges of developing ML-based applications and the current industry practices. Moreover, it is unclear where software engineering researchers should focus their efforts to better support ML application developers. In this paper, we report about a survey that aimed to understand the challenges and best practices of ML application development. We synthesize the results obtained from 80 practitioners (with diverse skills, experience, and application domains) into 17 findings; outlining challenges and best practices for ML application development. Practitioners involved in the development of ML-based software systems can leverage the summarized best practices to improve the quality of their system. We hope that the reported challenges will inform the research community about topics that need to be investigated to improve the engineering process and the quality of ML-based applications.
    A Unified DRO View of Multi-class Loss Functions with top-N Consistency. (arXiv:2112.14869v1 [cs.LG])
    (0 min) Multi-class classification is one of the most common tasks in machine learning applications, where data is labeled by one of many class labels. Many loss functions have been proposed for multi-class classification including two well-known ones, namely the cross-entropy (CE) loss and the crammer-singer (CS) loss (aka. the SVM loss). While CS loss has been used widely for traditional machine learning tasks, CE loss is usually a default choice for multi-class deep learning tasks. There are also top-$k$ variants of CS loss and CE loss that are proposed to promote the learning of a classifier for achieving better top-$k$ accuracy. Nevertheless, it still remains unclear the relationship between these different losses, which hinders our understanding of their expectations in different scenarios. In this paper, we present a unified view of the CS/CE losses and their smoothed top-$k$ variants by proposing a new family of loss functions, which are arguably better than the CS/CE losses when the given label information is incomplete and noisy. The new family of smooth loss functions named {label-distributionally robust (LDR) loss} is defined by leveraging the distributionally robust optimization (DRO) framework to model the uncertainty in the given label information, where the uncertainty over true class labels is captured by using distributional weights for each label regularized by a function.
    Deep Transfer-Learning for patient specific model re-calibration: Application to sEMG-Classification. (arXiv:2112.15019v1 [cs.LG])
    (0 min) Accurate decoding of surface electromyography (sEMG) is pivotal for muscle-to-machine-interfaces (MMI) and their application for e.g. rehabilitation therapy. sEMG signals have high inter-subject variability, due to various factors, including skin thickness, body fat percentage, and electrode placement. Therefore, obtaining high generalization quality of a trained sEMG decoder is quite challenging. Usually, machine learning based sEMG decoders are either trained on subject-specific data, or at least recalibrated for each user, individually. Even though, deep learning algorithms produced several state of the art results for sEMG decoding,however, due to the limited amount of availability of sEMG data, the deep learning models are prone to overfitting. Recently, transfer learning for domain adaptation improved generalization quality with reduced training time on various machine learning tasks. In this study, we investigate the effectiveness of transfer learning using weight initialization for recalibration of two different pretrained deep learning models on a new subjects data, and compare their performance to subject-specific models. To the best of our knowledge, this is the first study that thoroughly investigated weight-initialization based transfer learning for sEMG classification and compared transfer learning to subject-specific modeling. We tested our models on three publicly available databases under various settings. On average over all settings, our transfer learning approach improves 5~\%-points on the pretrained models without fine-tuning and 12~\%-points on the subject-specific models, while being trained on average 22~\% fewer epochs. Our results indicate that transfer learning enables faster training on fewer samples than user-specific models, and improves the performance of pretrained models as long as enough data is available.
    Neural Hierarchical Factorization Machines for User's Event Sequence Analysis. (arXiv:2112.15292v1 [cs.LG])
    (0 min) Many prediction tasks of real-world applications need to model multi-order feature interactions in user's event sequence for better detection performance. However, existing popular solutions usually suffer two key issues: 1) only focusing on feature interactions and failing to capture the sequence influence; 2) only focusing on sequence information, but ignoring internal feature relations of each event, thus failing to extract a better event representation. In this paper, we consider a two-level structure for capturing the hierarchical information over user's event sequence: 1) learning effective feature interactions based event representation; 2) modeling the sequence representation of user's historical events. Experimental results on both industrial and public datasets clearly demonstrate that our model achieves significantly better performance compared with state-of-the-art baselines.
    Crowd-sensing Enhanced Parking Patrol using Trajectories of Sharing Bikes. (arXiv:2110.15557v2 [cs.LG] UPDATED)
    (0 min) Illegal vehicle parking is a common urban problem faced by major cities in the world, as it incurs traffic jams, which lead to air pollution and traffic accidents. The government highly relies on active human efforts to detect illegal parking events. However, such an approach is extremely ineffective to cover a large city since the police have to patrol over the entire city roads. The massive and high-quality sharing bike trajectories from Mobike offer us a unique opportunity to design a ubiquitous illegal parking detection approach, as most of the illegal parking events happen at curbsides and have significant impact on the bike users. The detection result can guide the patrol schedule, i.e. send the patrol policemen to the region with higher illegal parking risks, and further improve the patrol efficiency. Inspired by this idea, three main components are employed in the proposed framework: 1)~{\em trajectory pre-processing}, which filters outlier GPS points, performs map-matching, and builds trajectory indexes; 2)~{\em illegal parking detection}, which models the normal trajectories, extracts features from the evaluation trajectories, and utilizes a distribution test-based method to discover the illegal parking events; and 3)~{\em patrol scheduling}, which leverages the detection result as reference context, and models the scheduling task as a multi-agent reinforcement learning problem to guide the patrol police. Finally, extensive experiments are presented to validate the effectiveness of illegal parking detection, as well as the improvement of patrol efficiency.
    Towards Robustness of Neural Networks. (arXiv:2112.15188v1 [cs.CV])
    (0 min) We introduce several new datasets namely ImageNet-A/O and ImageNet-R as well as a synthetic environment and testing suite we called CAOS. ImageNet-A/O allow researchers to focus in on the blind spots remaining in ImageNet. ImageNet-R was specifically created with the intention of tracking robust representation as the representations are no longer simply natural but include artistic, and other renditions. The CAOS suite is built off of CARLA simulator which allows for the inclusion of anomalous objects and can create reproducible synthetic environment and scenes for testing robustness. All of the datasets were created for testing robustness and measuring progress in robustness. The datasets have been used in various other works to measure their own progress in robustness and allowing for tangential progress that does not focus exclusively on natural accuracy. Given these datasets, we created several novel methods that aim to advance robustness research. We build off of simple baselines in the form of Maximum Logit, and Typicality Score as well as create a novel data augmentation method in the form of DeepAugment that improves on the aforementioned benchmarks. Maximum Logit considers the logit values instead of the values after the softmax operation, while a small change produces noticeable improvements. The Typicality Score compares the output distribution to a posterior distribution over classes. We show that this improves performance over the baseline in all but the segmentation task. Speculating that perhaps at the pixel level the semantic information of a pixel is less meaningful than that of class level information. Finally the new augmentation technique of DeepAugment utilizes neural networks to create augmentations on images that are radically different than the traditional geometric and camera based transformations used previously.
    G-PATE: Scalable Differentially Private Data Generator via Private Aggregation of Teacher Discriminators. (arXiv:1906.09338v2 [cs.LG] UPDATED)
    (0 min) Recent advances in machine learning have largely benefited from the massive accessible training data. However, large-scale data sharing has raised great privacy concerns. In this work, we propose a novel privacy-preserving data Generative model based on the PATE framework (G-PATE), aiming to train a scalable differentially private data generator that preserves high generated data utility. Our approach leverages generative adversarial nets to generate data, combined with private aggregation among different discriminators to ensure strong privacy guarantees. Compared to existing approaches, G-PATE significantly improves the use of privacy budgets. In particular, we train a student data generator with an ensemble of teacher discriminators and propose a novel private gradient aggregation mechanism to ensure differential privacy on all information that flows from teacher discriminators to the student generator. In addition, with random projection and gradient discretization, the proposed gradient aggregation mechanism is able to effectively deal with high-dimensional gradient vectors. Theoretically, we prove that G-PATE ensures differential privacy for the data generator. Empirically, we demonstrate the superiority of G-PATE over prior work through extensive experiments. We show that G-PATE is the first work being able to generate high-dimensional image data with high data utility under limited privacy budgets ($\epsilon \le 1$). Our code is available at https://github.com/AI-secure/G-PATE.
    Speedup deep learning models on GPU by taking advantage of efficient unstructured pruning and bit-width reduction. (arXiv:2112.15445v1 [cs.LG])
    (0 min) This work is focused on the pruning of some convolutional neural networks (CNNs) and improving theirs efficiency on graphic processing units (GPU) by using a direct sparse algorithm. The Nvidia deep neural network (cuDnn) library is the most effective implementations of deep learning (DL) algorithms for GPUs. GPUs are the most commonly used accelerators for deep learning computations. One of the most common techniques for improving the efficiency of CNN models is weight pruning and quantization. There are two main types of pruning: structural and non-structural. The first enables much easier acceleration on many type of accelerators, but with this type it is difficult to achieve a sparsity level and accuracy as high as that obtained with the second type. Non-structural pruning with retraining can generate a weight tensors up to 90% or more of sparsity in some deep CNN models. In this article the pruning algorithm is presented which makes it possible to achieve high sparsity levels without accuracy drop. In the next stage the linear and non-linear quantization is adapted for further time and footprint reduction. This paper is an extended of previously published paper concerning effective pruning techniques and present real models pruned with high sparsities and reduced precision which can achieve better performance than the CuDnn library.
    Memory AMP. (arXiv:2012.10861v5 [cs.IT] UPDATED)
    (0 min) Approximate message passing (AMP) is a low-cost iterative parameter-estimation technique for certain high-dimensional linear systems with non-Gaussian distributions. However, AMP only applies to independent identically distributed (IID) transform matrices, but may become unreliable (e.g., perform poorly or even diverge) for other matrix ensembles, especially for ill-conditioned ones. Orthogonal/vector AMP (OAMP/VAMP) was proposed for general right-unitarily-invariant matrices to handle this difficulty. However, the Bayes-optimal OAMP/VAMP (BO-OAMP/VAMP) requires a high-complexity linear minimum mean square error (MMSE) estimator. This limits the application of OAMP/VAMP to large-scale systems. To solve the disadvantages of AMP and BO-OAMP/VAMP, this paper proposes a memory AMP (MAMP) framework under an orthogonality principle, which guarantees the asymptotic IID Gaussianity of estimation errors in MAMP. We present an orthogonalization procedure for the local memory estimators to realize the required orthogonality for MAMP. Furthermore, we propose a Bayes-optimal MAMP (BO-MAMP), in which a long-memory matched filter is proposed for interference suppression. The complexity of BO-MAMP is comparable to AMP. A state evolution is derived to asymptotically characterize the performance of BO-MAMP. Based on state evolution, the relaxation parameters and damping vector in BO-MAMP are optimized. For all right-unitarily-invariant matrices, the state evolution of the optimized BO-MAMP converges to the same fixed point as that of the high-complexity BO-OAMP/VAMP and is Bayes-optimal if its state evolution has a unique fixed point. Finally, simulations are provided to verify the validity and accuracy of the theoretical results.
    Uniform-PAC Bounds for Reinforcement Learning with Linear Function Approximation. (arXiv:2106.11612v2 [cs.LG] UPDATED)
    (0 min) We study reinforcement learning (RL) with linear function approximation. Existing algorithms for this problem only have high-probability regret and/or Probably Approximately Correct (PAC) sample complexity guarantees, which cannot guarantee the convergence to the optimal policy. In this paper, in order to overcome the limitation of existing algorithms, we propose a new algorithm called FLUTE, which enjoys uniform-PAC convergence to the optimal policy with high probability. The uniform-PAC guarantee is the strongest possible guarantee for reinforcement learning in the literature, which can directly imply both PAC and high probability regret bounds, making our algorithm superior to all existing algorithms with linear function approximation. At the core of our algorithm is a novel minimax value function estimator and a multi-level partition scheme to select the training samples from historical observations. Both of these techniques are new and of independent interest.
    Investigating and Modeling the Dynamics of Long Ties. (arXiv:2109.10523v2 [cs.SI] UPDATED)
    (0 min) Long ties, the social ties that bridge different communities, are widely believed to play crucial roles in spreading novel information in social networks. However, some existing network theories and prediction models indicate that long ties might dissolve quickly or eventually become redundant, thus putting into question the long-term value of long ties. Our empirical analysis of real-world dynamic networks shows that contrary to such reasoning, long ties are more likely to persist than other social ties, and that many of them constantly function as social bridges without being embedded in local networks. Using a novel cost-benefit analysis model combined with machine learning, we show that long ties are highly beneficial, which instinctively motivates people to expend extra effort to maintain them. This partly explains why long ties are more persistent than what has been suggested by many existing theories and models. Overall, our study suggests the need for social interventions that can promote the formation of long ties, such as mixing people with diverse backgrounds.
    Fairness-Oriented User Scheduling for Bursty Downlink Transmission Using Multi-Agent Reinforcement Learning. (arXiv:2012.15081v13 [cs.OS] UPDATED)
    (0 min) In this work, we develop practical user scheduling algorithms for downlink bursty traffic with emphasis on user fairness. In contrast to the conventional scheduling algorithms that either equally divides the transmission time slots among users or maximizing some ratios without physcial meanings, we propose to use the 5%-tile user data rate (5TUDR) as the metric to evaluate user fairness. Since it is difficult to directly optimize 5TUDR, we first cast the problem into the stochastic game framework and subsequently propose a Multi-Agent Reinforcement Learning (MARL)-based algorithm to perform distributed optimization on the resource block group (RBG) allocation. Furthermore, each MARL agent is designed to take information measured by network counters from multiple network layers (e.g. Channel Quality Indicator, Buffer size) as the input states while the RBG allocation as action with a proposed reward function designed to maximize 5TUDR. Extensive simulation is performed to show that the proposed MARL-based scheduler can achieve fair scheduling while maintaining good average network throughput as compared to conventional schedulers.
    Motif Graph Neural Network. (arXiv:2112.14900v1 [cs.LG])
    (0 min) Graphs can model complicated interactions between entities, which naturally emerge in many important applications. These applications can often be cast into standard graph learning tasks, in which a crucial step is to learn low-dimensional graph representations. Graph neural networks (GNNs) are currently the most popular model in graph embedding approaches. However, standard GNNs in the neighborhood aggregation paradigm suffer from limited discriminative power in distinguishing \emph{high-order} graph structures as opposed to \emph{low-order} structures. To capture high-order structures, researchers have resorted to motifs and developed motif-based GNNs. However, existing motif-based GNNs still often suffer from less discriminative power on high-order structures. To overcome the above limitations, we propose Motif Graph Neural Network (MGNN), a novel framework to better capture high-order structures, hinging on our proposed motif redundancy minimization operator and injective motif combination. First, MGNN produces a set of node representations w.r.t. each motif. The next phase is our proposed redundancy minimization among motifs which compares the motifs with each other and distills the features unique to each motif. Finally, MGNN performs the updating of node representations by combining multiple representations from different motifs. In particular, to enhance the discriminative power, MGNN utilizes an injective function to combine the representations w.r.t. different motifs. We further show that our proposed architecture increases the expressive power of GNNs with a theoretical analysis. We demonstrate that MGNN outperforms state-of-the-art methods on seven public benchmarks on both node classification and graph classification tasks.
    SPViT: Enabling Faster Vision Transformers via Soft Token Pruning. (arXiv:2112.13890v1 [cs.CV] CROSS LISTED)
    (0 min) Recently, Vision Transformer (ViT) has continuously established new milestones in the computer vision field, while the high computation and memory cost makes its propagation in industrial production difficult. Pruning, a traditional model compression paradigm for hardware efficiency, has been widely applied in various DNN structures. Nevertheless, it stays ambiguous on how to perform exclusive pruning on the ViT structure. Considering three key points: the structural characteristics, the internal data pattern of ViTs, and the related edge device deployment, we leverage the input token sparsity and propose a computation-aware soft pruning framework, which can be set up on vanilla Transformers of both flatten and CNN-type structures, such as Pooling-based ViT (PiT). More concretely, we design a dynamic attention-based multi-head token selector, which is a lightweight module for adaptive instance-wise token selection. We further introduce a soft pruning technique, which integrates the less informative tokens generated by the selector module into a package token that will participate in subsequent calculations rather than being completely discarded. Our framework is bound to the trade-off between accuracy and computation constraints of specific edge devices through our proposed computation-aware training strategy. Experimental results show that our framework significantly reduces the computation cost of ViTs while maintaining comparable performance on image classification. Moreover, our framework can guarantee the identified model to meet resource specifications of mobile devices and FPGA, and even achieve the real-time execution of DeiT-T on mobile platforms. For example, our method reduces the latency of DeiT-T to 26 ms (26%$\sim $41% superior to existing works) on the mobile device with 0.25%$\sim $4% higher top-1 accuracy on ImageNet. Our code will be released soon.
    Weakly Supervised Change Detection Using Guided Anisotropic Difusion. (arXiv:2112.15367v1 [eess.IV])
    (0 min) Large scale datasets created from crowdsourced labels or openly available data have become crucial to provide training data for large scale learning algorithms. While these datasets are easier to acquire, the data are frequently noisy and unreliable, which is motivating research on weakly supervised learning techniques. In this paper we propose original ideas that help us to leverage such datasets in the context of change detection. First, we propose the guided anisotropic diffusion (GAD) algorithm, which improves semantic segmentation results using the input images as guides to perform edge preserving filtering. We then show its potential in two weakly-supervised learning strategies tailored for change detection. The first strategy is an iterative learning method that combines model optimisation and data cleansing using GAD to extract the useful information from a large scale change detection dataset generated from open vector data. The second one incorporates GAD within a novel spatial attention layer that increases the accuracy of weakly supervised networks trained to perform pixel-level predictions from image-level labels. Improvements with respect to state-of-the-art are demonstrated on 4 different public datasets.
    Recent Trends in Artificial Intelligence-inspired Electronic Thermal Management. (arXiv:2112.14837v1 [cs.LG])
    (0 min) The rise of computation-based methods in thermal management has gained immense attention in recent years due to the ability of deep learning to solve complex 'physics' problems, which are otherwise difficult to be approached using conventional techniques. Thermal management is required in electronic systems to keep them from overheating and burning, enhancing their efficiency and lifespan. For a long time, numerical techniques have been employed to aid in the thermal management of electronics. However, they come with some limitations. To increase the effectiveness of traditional numerical approaches and address the drawbacks faced in conventional approaches, researchers have looked at using artificial intelligence at various stages of the thermal management process. The present study discusses in detail, the current uses of deep learning in the domain of 'electronic' thermal management.
    What is Event Knowledge Graph: A Survey. (arXiv:2112.15280v1 [cs.LG])
    (0 min) Besides entity-centric knowledge, usually organized as Knowledge Graph (KG), events are also an essential kind of knowledge in the world, which trigger the spring up of event-centric knowledge representation form like Event KG (EKG). It plays an increasingly important role in many machine learning and artificial intelligence applications, such as intelligent search, question-answering, recommendation, and text generation. This paper provides a comprehensive survey of EKG from history, ontology, instance, and application views. Specifically, to characterize EKG thoroughly, we focus on its history, definitions, schema induction, acquisition, related representative graphs/systems, and applications. The development processes and trends are studied therein. We further summarize perspective directions to facilitate future research on EKG.
    K-Core Decomposition on Super Large Graphs with Limited Resources. (arXiv:2112.14840v1 [cs.DC])
    (0 min) K-core decomposition is a commonly used metric to analyze graph structure or study the relative importance of nodes in complex graphs. Recent years have seen rapid growth in the scale of the graph, especially in industrial settings. For example, our industrial partner runs popular social applications with billions of users and is able to gather a rich set of user data. As a result, applying K-core decomposition on large graphs has attracted more and more attention from academics and the industry. A simple but effective method to deal with large graphs is to train them in the distributed settings, and some distributed K-core decomposition algorithms are also proposed. Despite their effectiveness, we experimentally and theoretically observe that these algorithms consume too many resources and become unstable on super-large-scale graphs, especially when the given resources are limited. In this paper, we deal with those super-large-scale graphs and propose a divide-and-conquer strategy on top of the distributed K-core decomposition algorithm. We evaluate our approach on three large graphs. The experimental results show that the consumption of resources can be significantly reduced, and the calculation on large-scale graphs becomes more stable than the existing methods. For example, the distributed K-core decomposition algorithm can scale to a large graph with 136 billion edges without losing correctness with our divide-and-conquer technique.
    Persformer: A Transformer Architecture for Topological Machine Learning. (arXiv:2112.15210v1 [cs.LG])
    (0 min) One of the main challenges of Topological Data Analysis (TDA) is to extract features from persistent diagrams directly usable by machine learning algorithms. Indeed, persistence diagrams are intrinsically (multi-)sets of points in R2 and cannot be seen in a straightforward manner as vectors. In this article, we introduce Persformer, the first Transformer neural network architecture that accepts persistence diagrams as input. The Persformer architecture significantly outperforms previous topological neural network architectures on classical synthetic benchmark datasets. Moreover, it satisfies a universal approximation theorem. This allows us to introduce the first interpretability method for topological machine learning, which we explore in two examples.
    Audio-to-symbolic Arrangement via Cross-modal Music Representation Learning. (arXiv:2112.15110v1 [cs.SD])
    (0 min) Could we automatically derive the score of a piano accompaniment based on the audio of a pop song? This is the audio-to-symbolic arrangement problem we tackle in this paper. A good arrangement model should not only consider the audio content but also have prior knowledge of piano composition (so that the generation "sounds like" the audio and meanwhile maintains musicality.) To this end, we contribute a cross-modal representation-learning model, which 1) extracts chord and melodic information from the audio, and 2) learns texture representation from both audio and a corrupted ground truth arrangement. We further introduce a tailored training strategy that gradually shifts the source of texture information from corrupted score to audio. In the end, the score-based texture posterior is reduced to a standard normal distribution, and only audio is needed for inference. Experiments show that our model captures major audio information and outperforms baselines in generation quality.
    Mythological Medical Machine Learning: Boosting the Performance of a Deep Learning Medical Data Classifier Using Realistic Physiological Models. (arXiv:2112.15442v1 [cs.LG])
    (0 min) Objective: To determine if a realistic, but computationally efficient model of the electrocardiogram can be used to pre-train a deep neural network (DNN) with a wide range of morphologies and abnormalities specific to a given condition - T-wave Alternans (TWA) as a result of Post-Traumatic Stress Disorder, or PTSD - and significantly boost performance on a small database of rare individuals. Approach: Using a previously validated artificial ECG model, we generated 180,000 artificial ECGs with or without significant TWA, with varying heart rate, breathing rate, TWA amplitude, and ECG morphology. A DNN, trained on over 70,000 patients to classify 25 different rhythms, was modified the output layer to a binary class (TWA or no-TWA, or equivalently, PTSD or no-PTSD), and transfer learning was performed on the artificial ECG. In a final transfer learning step, the DNN was trained and cross-validated on ECG from 12 PTSD and 24 controls for all combinations of using the three databases. Main results: The best performing approach (AUROC = 0.77, Accuracy = 0.72, F1-score = 0.64) was found by performing both transfer learning steps, using the pre-trained arrhythmia DNN, the artificial data and the real PTSD-related ECG data. Removing the artificial data from training led to the largest drop in performance. Removing the arrhythmia data from training provided a modest, but significant, drop in performance. The final model showed no significant drop in performance on the artificial data, indicating no overfitting. Significance: In healthcare, it is common to only have a small collection of high-quality data and labels, or a larger database with much lower quality (and less relevant) labels. The paradigm presented here, involving model-based performance boosting, provides a solution through transfer learning on a large realistic artificial database, and a partially relevant real database.
    GANISP: a GAN-assisted Importance SPlitting Probability Estimator. (arXiv:2112.15444v1 [cs.LG])
    (0 min) Designing manufacturing processes with high yield and strong reliability relies on effective methods for rare event estimation. Genealogical importance splitting reduces the variance of rare event probability estimators by iteratively selecting and replicating realizations that are headed towards a rare event. The replication step is difficult when applied to deterministic systems where the initial conditions of the offspring realizations need to be modified. Typically, a random perturbation is applied to the offspring to differentiate their trajectory from the parent realization. However, this random perturbation strategy may be effective for some systems while failing for others, preventing variance reduction in the probability estimate. This work seeks to address this limitation using a generative model such as a Generative Adversarial Network (GAN) to generate perturbations that are consistent with the attractor of the dynamical system. The proposed GAN-assisted Importance SPlitting method (GANISP) improves the variance reduction for the system targeted. An implementation of the method is available in a companion repository (https://github.com/NREL/GANISP).
    Data-Free Knowledge Transfer: A Survey. (arXiv:2112.15278v1 [cs.LG])
    (0 min) In the last decade, many deep learning models have been well trained and made a great success in various fields of machine intelligence, especially for computer vision and natural language processing. To better leverage the potential of these well-trained models in intra-domain or cross-domain transfer learning situations, knowledge distillation (KD) and domain adaptation (DA) are proposed and become research highlights. They both aim to transfer useful information from a well-trained model with original training data. However, the original data is not always available in many cases due to privacy, copyright or confidentiality. Recently, the data-free knowledge transfer paradigm has attracted appealing attention as it deals with distilling valuable knowledge from well-trained models without requiring to access to the training data. In particular, it mainly consists of the data-free knowledge distillation (DFKD) and source data-free domain adaptation (SFDA). On the one hand, DFKD aims to transfer the intra-domain knowledge of original data from a cumbersome teacher network to a compact student network for model compression and efficient inference. On the other hand, the goal of SFDA is to reuse the cross-domain knowledge stored in a well-trained source model and adapt it to a target domain. In this paper, we provide a comprehensive survey on data-free knowledge transfer from the perspectives of knowledge distillation and unsupervised domain adaptation, to help readers have a better understanding of the current research status and ideas. Applications and challenges of the two areas are briefly reviewed, respectively. Furthermore, we provide some insights to the subject of future research.
    A Graph Attention Learning Approach to Antenna Tilt Optimization. (arXiv:2112.14843v1 [cs.LG])
    (0 min) 6G will move mobile networks towards increasing levels of complexity. To deal with this complexity, optimization of network parameters is key to ensure high performance and timely adaptivity to dynamic network environments. The optimization of the antenna tilt provides a practical and cost-efficient method to improve coverage and capacity in the network. Previous methods based on Reinforcement Learning (RL) have shown great promise for tilt optimization by learning adaptive policies outperforming traditional tilt optimization methods. However, most existing RL methods are based on single-cell features representation, which fails to fully characterize the agent state, resulting in suboptimal performance. Also, most of such methods lack scalability, due to state-action explosion, and generalization ability. In this paper, we propose a Graph Attention Q-learning (GAQ) algorithm for tilt optimization. GAQ relies on a graph attention mechanism to select relevant neighbors information, improve the agent state representation, and update the tilt control policy based on a history of observations using a Deep Q-Network (DQN). We show that GAQ efficiently captures important network information and outperforms standard DQN with local information by a large margin. In addition, we demonstrate its ability to generalize to network deployments of different sizes and densities.
    A theory of independent mechanisms for extrapolation in generative models. (arXiv:2004.00184v2 [cs.LG] UPDATED)
    (0 min) Generative models can be trained to emulate complex empirical data, but are they useful to make predictions in the context of previously unobserved environments? An intuitive idea to promote such extrapolation capabilities is to have the architecture of such model reflect a causal graph of the true data generating process, such that one can intervene on each node independently of the others. However, the nodes of this graph are usually unobserved, leading to overparameterization and lack of identifiability of the causal structure. We develop a theoretical framework to address this challenging situation by defining a weaker form of identifiability, based on the principle of independence of mechanisms. We demonstrate on toy examples that classical stochastic gradient descent can hinder the model's extrapolation capabilities, suggesting independence of mechanisms should be enforced explicitly during training. Experiments on deep generative models trained on real world data support these insights and illustrate how the extrapolation capabilities of such models can be leveraged.
    Bayesian Optimization of Function Networks. (arXiv:2112.15311v1 [cs.LG])
    (2 min) We consider Bayesian optimization of the output of a network of functions, where each function takes as input the output of its parent nodes, and where the network takes significant time to evaluate. Such problems arise, for example, in reinforcement learning, engineering design, and manufacturing. While the standard Bayesian optimization approach observes only the final output, our approach delivers greater query efficiency by leveraging information that the former ignores: intermediate output within the network. This is achieved by modeling the nodes of the network using Gaussian processes and choosing the points to evaluate using, as our acquisition function, the expected improvement computed with respect to the implied posterior on the objective. Although the non-Gaussian nature of this posterior prevents computing our acquisition function in closed form, we show that it can be efficiently maximized via sample average approximation. In addition, we prove that our method is asymptotically consistent, meaning that it finds a globally optimal solution as the number of evaluations grows to infinity, thus generalizing previously known convergence results for the expected improvement. Notably, this holds even though our method might not evaluate the domain densely, instead leveraging problem structure to leave regions unexplored. Finally, we show that our approach dramatically outperforms standard Bayesian optimization methods in several synthetic and real-world problems.
    Settling the Bias and Variance of Meta-Gradient Estimation for Meta-Reinforcement Learning. (arXiv:2112.15400v1 [cs.LG])
    (0 min) In recent years, gradient based Meta-RL (GMRL) methods have achieved remarkable successes in either discovering effective online hyperparameter for one single task (Xu et al., 2018) or learning good initialisation for multi-task transfer learning (Finn et al., 2017). Despite the empirical successes, it is often neglected that computing meta gradients via vanilla backpropagation is ill-defined. In this paper, we argue that the stochastic meta-gradient estimation adopted by many existing MGRL methods are in fact biased; the bias comes from two sources: 1) the compositional bias that is inborn in the structure of compositional optimisation problems and 2) the bias of multi-step Hessian estimation caused by direct automatic differentiation. To better understand the meta gradient biases, we perform the first of its kind study to quantify the amount for each of them. We start by providing a unifying derivation for existing GMRL algorithms, and then theoretically analyse both the bias and the variance of existing gradient estimation methods. On understanding the underlying principles of bias, we propose two mitigation solutions based on off-policy correction and multi-step Hessian estimation techniques. Comprehensive ablation studies have been conducted and results reveals: (1) The existence of these two biases and how they influence the meta-gradient estimation when combined with different estimator/sample size/step and learning rate. (2) The effectiveness of these mitigation approaches for meta-gradient estimation and thereby the final return on two practical Meta-RL algorithms: LOLA-DiCE and Meta-gradient Reinforcement Learning.
    On the Role of Neural Collapse in Transfer Learning. (arXiv:2112.15121v1 [cs.LG])
    (2 min) We study the ability of foundation models to learn representations for classification that are transferable to new, unseen classes. Recent results in the literature show that representations learned by a single classifier over many classes are competitive on few-shot learning problems with representations learned by special-purpose algorithms designed for such problems. In this paper we provide an explanation for this behavior based on the recently observed phenomenon that the features learned by overparameterized classification networks show an interesting clustering property, called neural collapse. We demonstrate both theoretically and empirically that neural collapse generalizes to new samples from the training classes, and -- more importantly -- to new classes as well, allowing foundation models to provide feature maps that work well in transfer learning and, specifically, in the few-shot setting.
    Deep Learning Models for Knowledge Tracing: Review and Empirical Evaluation. (arXiv:2112.15072v1 [cs.LG])
    (2 min) In this work, we review and evaluate a body of deep learning knowledge tracing (DLKT) models with openly available and widely-used data sets, and with a novel data set of students learning to program. The evaluated DLKT models have been reimplemented for assessing reproducibility and replicability of previously reported results. We test different input and output layer variations found in the compared models that are independent of the main architectures of the models, and different maximum attempt count options that have been implicitly and explicitly used in some studies. Several metrics are used to reflect on the quality of the evaluated knowledge tracing models. The evaluated knowledge tracing models include Vanilla-DKT, two Long Short-Term Memory Deep Knowledge Tracing (LSTM-DKT) variants, two Dynamic Key-Value Memory Network (DKVMN) variants, and Self-Attentive Knowledge Tracing (SAKT). We evaluate logistic regression, Bayesian Knowledge Tracing (BKT) and simple non-learning models as baselines. Our results suggest that the DLKT models in general outperform non-DLKT models, and the relative differences between the DLKT models are subtle and often vary between datasets. Our results also show that naive models such as mean prediction can yield better performance than more sophisticated knowledge tracing models, especially in terms of accuracy. Further, our metric and hyperparameter analysis shows that the metric used to select the best model hyperparameters has a noticeable effect on the performance of the models, and that metric choice can affect model ranking. We also study the impact of input and output layer variations, filtering out long attempt sequences, and non-model properties such as randomness and hardware. Finally, we discuss model performance replicability and related issues. Our model implementations, evaluation code, and data are published as a part of this work.
    Reversible Upper Confidence Bound Algorithm to Generate Diverse Optimized Candidates. (arXiv:2112.14893v1 [cs.LG])
    (2 min) Most algorithms for the multi-armed bandit problem in reinforcement learning aimed to maximize the expected reward, which are thus useful in searching the optimized candidate with the highest reward (function value) for diverse applications (e.g., AlphaGo). However, in some typical application scenaios such as drug discovery, the aim is to search a diverse set of candidates with high reward. Here we propose a reversible upper confidence bound (rUCB) algorithm for such a purpose, and demonstrate its application in virtual screening upon intrinsically disordered proteins (IDPs). It is shown that rUCB greatly reduces the query times while achieving both high accuracy and low performance loss.The rUCB may have potential application in multipoint optimization and other reinforcement-learning cases.
    Learned Coarse Models for Efficient Turbulence Simulation. (arXiv:2112.15275v1 [physics.flu-dyn])
    (0 min) Turbulence simulation with classical numerical solvers requires very high-resolution grids to accurately resolve dynamics. Here we train learned simulators at low spatial and temporal resolutions to capture turbulent dynamics generated at high resolution. We show that our proposed model can simulate turbulent dynamics more accurately than classical numerical solvers at the same low resolutions across various scientifically relevant metrics. Our model is trained end-to-end from data and is capable of learning a range of challenging chaotic and turbulent dynamics at low resolution, including trajectories generated by the state-of-the-art Athena++ engine. We show that our simpler, general-purpose architecture outperforms various more specialized, turbulence-specific architectures from the learned turbulence simulation literature. In general, we see that learned simulators yield unstable trajectories; however, we show that tuning training noise and temporal downsampling solves this problem. We also find that while generalization beyond the training distribution is a challenge for learned models, training noise, convolutional architectures, and added loss constraints can help. Broadly, we conclude that our learned simulator outperforms traditional solvers run on coarser grids, and emphasize that simple design choices can offer stability and robust generalization.
    Deep Graph Clustering via Dual Correlation Reduction. (arXiv:2112.14772v1 [cs.LG])
    (0 min) Deep graph clustering, which aims to reveal the underlying graph structure and divide the nodes into different groups, has attracted intensive attention in recent years. However, we observe that, in the process of node encoding, existing methods suffer from representation collapse which tends to map all data into the same representation. Consequently, the discriminative capability of the node representation is limited, leading to unsatisfied clustering performance. To address this issue, we propose a novel self-supervised deep graph clustering method termed Dual Correlation Reduction Network (DCRN) by reducing information correlation in a dual manner. Specifically, in our method, we first design a siamese network to encode samples. Then by forcing the cross-view sample correlation matrix and cross-view feature correlation matrix to approximate two identity matrices, respectively, we reduce the information correlation in the dual-level, thus improving the discriminative capability of the resulting features. Moreover, in order to alleviate representation collapse caused by over-smoothing in GCN, we introduce a propagation regularization term to enable the network to gain long-distance information with the shallow network structure. Extensive experimental results on six benchmark datasets demonstrate the effectiveness of the proposed DCRN against the existing state-of-the-art methods.
    Active Learning-Based Optimization of Scientific Experimental Design. (arXiv:2112.14811v1 [cs.LG])
    (0 min) Active learning (AL) is a machine learning algorithm that can achieve greater accuracy with fewer labeled training instances, for having the ability to ask oracles to label the most valuable unlabeled data chosen iteratively and heuristically by query strategies. Scientific experiments nowadays, though becoming increasingly automated, are still suffering from human involvement in the designing process and the exhaustive search in the experimental space. This article performs a retrospective study on a drug response dataset using the proposed AL scheme comprised of the matrix factorization method of alternating least square (ALS) and deep neural networks (DNN). This article also proposes an AL query strategy based on expected loss minimization. As a result, the retrospective study demonstrates that scientific experimental design, instead of being manually set, can be optimized by AL, and the proposed query strategy ELM sampling shows better experimental performance than other ones such as random sampling and uncertainty sampling.
    Digital Rock Typing DRT Algorithm Formulation with Optimal Supervised Semantic Segmentation. (arXiv:2112.15068v1 [cs.LG])
    (0 min) Each grid block in a 3D geological model requires a rock type that represents all physical and chemical properties of that block. The properties that classify rock types are lithology, permeability, and capillary pressure. Scientists and engineers determined these properties using conventional laboratory measurements, which embedded destructive methods to the sample or altered some of its properties (i.e., wettability, permeability, and porosity) because the measurements process includes sample crushing, fluid flow, or fluid saturation. Lately, Digital Rock Physics (DRT) has emerged to quantify these properties from micro-Computerized Tomography (uCT) and Magnetic Resonance Imaging (MRI) images. However, the literature did not attempt rock typing in a wholly digital context. We propose performing Digital Rock Typing (DRT) by: (1) integrating the latest DRP advances in a novel process that honors digital rock properties determination, while; (2) digitalizing the latest rock typing approaches in carbonate, and (3) introducing a novel carbonate rock typing process that utilizes computer vision capabilities to provide more insight about the heterogeneous carbonate rock texture.
    Retrieving Black-box Optimal Images from External Databases. (arXiv:2112.14921v1 [cs.IR])
    (2 min) Suppose we have a black-box function (e.g., deep neural network) that takes an image as input and outputs a value that indicates preference. How can we retrieve optimal images with respect to this function from an external database on the Internet? Standard retrieval problems in the literature (e.g., item recommendations) assume that an algorithm has full access to the set of items. In other words, such algorithms are designed for service providers. In this paper, we consider the retrieval problem under different assumptions. Specifically, we consider how users with limited access to an image database can retrieve images using their own black-box functions. This formulation enables a flexible and finer-grained image search defined by each user. We assume the user can access the database through a search query with tight API limits. Therefore, a user needs to efficiently retrieve optimal images in terms of the number of queries. We propose an efficient retrieval algorithm Tiara for this problem. In the experiments, we confirm that our proposed method performs better than several baselines under various settings.
    Local Quadratic Convergence of Stochastic Gradient Descent with Adaptive Step Size. (arXiv:2112.14872v1 [math.OC])
    (0 min) Establishing a fast rate of convergence for optimization methods is crucial to their applicability in practice. With the increasing popularity of deep learning over the past decade, stochastic gradient descent and its adaptive variants (e.g. Adagrad, Adam, etc.) have become prominent methods of choice for machine learning practitioners. While a large number of works have demonstrated that these first order optimization methods can achieve sub-linear or linear convergence, we establish local quadratic convergence for stochastic gradient descent with adaptive step size for problems such as matrix inversion.
    RheFrameDetect: A Text Classification System for Automatic Detection of Rhetorical Frames in AI from Open Sources. (arXiv:2112.14933v1 [cs.CL])
    (0 min) Rhetorical Frames in AI can be thought of as expressions that describe AI development as a competition between two or more actors, such as governments or companies. Examples of such Frames include robotic arms race, AI rivalry, technological supremacy, cyberwarfare dominance and 5G race. Detection of Rhetorical Frames from open sources can help us track the attitudes of governments or companies towards AI, specifically whether attitudes are becoming more cooperative or competitive over time. Given the rapidly increasing volumes of open sources (online news media, twitter, blogs), it is difficult for subject matter experts to identify Rhetorical Frames in (near) real-time. Moreover, these sources are in general unstructured (noisy) and therefore, detecting Frames from these sources will require state-of-the-art text classification techniques. In this paper, we develop RheFrameDetect, a text classification system for (near) real-time capture of Rhetorical Frames from open sources. Given an input document, RheFrameDetect employs text classification techniques at multiple levels (document level and paragraph level) to identify all occurrences of Frames used in the discussion of AI. We performed extensive evaluation of the text classification techniques used in RheFrameDetect against human annotated Frames from multiple news sources. To further demonstrate the effectiveness of RheFrameDetect, we show multiple case studies depicting the Frames identified by RheFrameDetect compared against human annotated Frames.
    Training Recurrent Neural Networks by Sequential Least Squares and the Alternating Direction Method of Multipliers. (arXiv:2112.15348v1 [cs.LG])
    (0 min) For training recurrent neural network models of nonlinear dynamical systems from an input/output training dataset based on rather arbitrary convex and twice-differentiable loss functions and regularization terms, we propose the use of sequential least squares for determining the optimal network parameters and hidden states. In addition, to handle non-smooth regularization terms such as L1, L0, and group-Lasso regularizers, as well as to impose possibly non-convex constraints such as integer and mixed-integer constraints, we combine sequential least squares with the alternating direction method of multipliers (ADMM). The performance of the resulting algorithm, that we call NAILS (Nonconvex ADMM Iterations and Least Squares), is tested in a nonlinear system identification benchmark.
    Development of a face mask detection pipeline for mask-wearing monitoring in the era of the COVID-19 pandemic: A modular approach. (arXiv:2112.15031v1 [cs.CV])
    (2 min) During the SARS-Cov-2 pandemic, mask-wearing became an effective tool to prevent spreading and contracting the virus. The ability to monitor the mask-wearing rate in the population would be useful for determining public health strategies against the virus. However, artificial intelligence technologies for detecting face masks have not been deployed at a large scale in real-life to measure the mask-wearing rate in public. In this paper, we present a two-step face mask detection approach consisting of two separate modules: 1) face detection and alignment and 2) face mask classification. This approach allowed us to experiment with different combinations of face detection and face mask classification modules. More specifically, we experimented with PyramidKey and RetinaFace as face detectors while maintaining a lightweight backbone for the face mask classification module. Moreover, we also provide a relabeled annotation of the test set of the AIZOO dataset, where we rectified the incorrect labels for some face images. The evaluation results on the AIZOO and Moxa 3K datasets showed that the proposed face mask detection pipeline surpassed the state-of-the-art methods. The proposed pipeline also yielded a higher mAP on the relabeled test set of the AIZOO dataset than the original test set. Since we trained the proposed model using in-the-wild face images, we can successfully deploy our model to monitor the mask-wearing rate using public CCTV images.
    Domain Adaptation with Category Attention Network for Deep Sentiment Analysis. (arXiv:2112.15290v1 [cs.CL])
    (0 min) Domain adaptation tasks such as cross-domain sentiment classification aim to utilize existing labeled data in the source domain and unlabeled or few labeled data in the target domain to improve the performance in the target domain via reducing the shift between the data distributions. Existing cross-domain sentiment classification methods need to distinguish pivots, i.e., the domain-shared sentiment words, and non-pivots, i.e., the domain-specific sentiment words, for excellent adaptation performance. In this paper, we first design a Category Attention Network (CAN), and then propose a model named CAN-CNN to integrate CAN and a Convolutional Neural Network (CNN). On the one hand, the model regards pivots and non-pivots as unified category attribute words and can automatically capture them to improve the domain adaptation performance; on the other hand, the model makes an attempt at interpretability to learn the transferred category attribute words. Specifically, the optimization objective of our model has three different components: 1) the supervised classification loss; 2) the distributions loss of category feature weights; 3) the domain invariance loss. Finally, the proposed model is evaluated on three public sentiment analysis datasets and the results demonstrate that CAN-CNN can outperform other various baseline methods.
    Graph Neural Networks for Communication Networks: Context, Use Cases and Opportunities. (arXiv:2112.14792v1 [cs.NI])
    (2 min) Graph neural networks (GNN) have shown outstanding applications in many fields where data is fundamentally represented as graphs (e.g., chemistry, biology, recommendation systems). In this vein, communication networks comprise many fundamental components that are naturally represented in a graph-structured manner (e.g., topology, configurations, traffic flows). This position article presents GNNs as a fundamental tool for modeling, control and management of communication networks. GNNs represent a new generation of data-driven models that can accurately learn and reproduce the complex behaviors behind real networks. As a result, such models can be applied to a wide variety of networking use cases, such as planning, online optimization, or troubleshooting. The main advantage of GNNs over traditional neural networks lies in its unprecedented generalization capabilities when applied to other networks and configurations unseen during training, which is a critical feature for achieving practical data-driven solutions for networking. This article comprises a brief tutorial on GNNs and their possible applications to communication networks. To showcase the potential of this technology, we present two use cases with state-of-the-art GNN models respectively applied to wired and wireless networks. Lastly, we delve into the key open challenges and opportunities yet to be explored in this novel research area.
    Transfer learning for cancer diagnosis in histopathological images. (arXiv:2112.15523v1 [eess.IV])
    (0 min) Transfer learning allows us to exploit knowledge gained from one task to assist in solving another but relevant task. In modern computer vision research, the question is which architecture performs better for a given dataset. In this paper, we compare the performance of 14 pre-trained ImageNet models on the histopathologic cancer detection dataset, where each model has been configured as a naive model, feature extractor model, or fine-tuned model. Densenet161 has been shown to have high precision whilst Resnet101 has a high recall. A high precision model is suitable to be used when follow-up examination cost is high, whilst low precision but a high recall/sensitivity model can be used when the cost of follow-up examination is low. Results also show that transfer learning helps to converge a model faster.
    Explainable Signature-based Machine Learning Approach for Identification of Faults in Grid-Connected Photovoltaic Systems. (arXiv:2112.14842v1 [cs.LG])
    (0 min) The transformation of conventional power networks into smart grids with the heavy penetration level of renewable energy resources, particularly grid-connected Photovoltaic (PV) systems, has increased the need for efficient fault identification systems. Malfunctioning any single component in grid-connected PV systems may lead to grid instability and other serious consequences, showing that a reliable fault identification system is the utmost requirement for ensuring operational integrity. Therefore, this paper presents a novel fault identification approach based on statistical signatures of PV operational states. These signatures are unique because each fault has a different nature and distinctive impact on the electrical system. Thus, the Random Forest Classifier trained on these extracted signatures showed 100% accuracy in identifying all types of faults. Furthermore, the performance comparison of the proposed framework with other Machine Learning classifiers depicts its credibility. Moreover, to elevate user trust in the predicted outcomes, SHAP (Shapley Additive Explanation) was utilized during the training phase to extract a complete model response (global explanation). This extracted global explanation can help in the assessment of predicted outcomes credibility by decoding each prediction in terms of features contribution. Hence, the proposed explainable signature-based fault identification technique is highly credible and fulfills all the requirements of smart grids.
    Robust Entropy-regularized Markov Decision Processes. (arXiv:2112.15364v1 [cs.LG])
    (0 min) Stochastic and soft optimal policies resulting from entropy-regularized Markov decision processes (ER-MDP) are desirable for exploration and imitation learning applications. Motivated by the fact that such policies are sensitive with respect to the state transition probabilities, and the estimation of these probabilities may be inaccurate, we study a robust version of the ER-MDP model, where the stochastic optimal policies are required to be robust with respect to the ambiguity in the underlying transition probabilities. Our work is at the crossroads of two important schemes in reinforcement learning (RL), namely, robust MDP and entropy regularized MDP. We show that essential properties that hold for the non-robust ER-MDP and robust unregularized MDP models also hold in our settings, making the robust ER-MDP problem tractable. We show how our framework and results can be integrated into different algorithmic schemes including value or (modified) policy iteration, which would lead to new robust RL and inverse RL algorithms to handle uncertainties. Analyses on computational complexity and error propagation under conventional uncertainty settings are also provided.
    Control Theoretic Analysis of Temporal Difference Learning. (arXiv:2112.14417v2 [cs.AI] UPDATED)
    (0 min) The goal of this paper is to investigate a control theoretic analysis of linear stochastic iterative algorithm and temporal difference (TD) learning. TD-learning is a linear stochastic iterative algorithm to estimate the value function of a given policy for a Markov decision process, which is one of the most popular and fundamental reinforcement learning algorithms. While there has been a series of successful works in theoretical analysis of TD-learning, it was not until recently that researchers found some guarantees on its statistical efficiency. In this paper, we propose a control theoretic finite-time analysis TD-learning, which exploits standard notions in linear system control communities. Therefore, the proposed work provides additional insights on TD-learning and reinforcement learning with simple concepts and analysis tools in control theory.
    Constructing a Good Behavior Basis for Transfer using Generalized Policy Updates. (arXiv:2112.15025v1 [cs.LG])
    (0 min) We study the problem of learning a good set of policies, so that when combined together, they can solve a wide variety of unseen reinforcement learning tasks with no or very little new data. Specifically, we consider the framework of generalized policy evaluation and improvement, in which the rewards for all tasks of interest are assumed to be expressible as a linear combination of a fixed set of features. We show theoretically that, under certain assumptions, having access to a specific set of diverse policies, which we call a set of independent policies, can allow for instantaneously achieving high-level performance on all possible downstream tasks which are typically more complex than the ones on which the agent was trained. Based on this theoretical analysis, we propose a simple algorithm that iteratively constructs this set of policies. In addition to empirically validating our theoretical results, we compare our approach with recently proposed diverse policy set construction methods and show that, while others fail, our approach is able to build a behavior basis that enables instantaneous transfer to all possible downstream tasks. We also show empirically that having access to a set of independent policies can better bootstrap the learning process on downstream tasks where the new reward function cannot be described as a linear combination of the features. Finally, we demonstrate that this policy set can be useful in a realistic lifelong reinforcement learning setting.
    Optimal Model Averaging of Support Vector Machines in Diverging Model Spaces. (arXiv:2112.12961v2 [stat.ML] UPDATED)
    (0 min) Support vector machine (SVM) is a powerful classification method that has achieved great success in many fields. Since its performance can be seriously impaired by redundant covariates, model selection techniques are widely used for SVM with high dimensional covariates. As an alternative to model selection, significant progress has been made in the area of model averaging in the past decades. Yet no frequentist model averaging method was considered for SVM. This work aims to fill the gap and to propose a frequentist model averaging procedure for SVM which selects the optimal weight by cross validation. Even when the number of covariates diverges at an exponential rate of the sample size, we show asymptotic optimality of the proposed method in the sense that the ratio of its hinge loss to the lowest possible loss converges to one. We also derive the convergence rate which provides more insights to model averaging. Compared to model selection methods of SVM which require a tedious but critical task of tuning parameter selection, the model averaging method avoids the task and shows promising performances in the empirical studies.
    Decentralized Optimization Over the Stiefel Manifold by an Approximate Augmented Lagrangian Function. (arXiv:2112.14949v1 [math.OC])
    (2 min) In this paper, we focus on the decentralized optimization problem over the Stiefel manifold, which is defined on a connected network of $d$ agents. The objective is an average of $d$ local functions, and each function is privately held by an agent and encodes its data. The agents can only communicate with their neighbors in a collaborative effort to solve this problem. In existing methods, multiple rounds of communications are required to guarantee the convergence, giving rise to high communication costs. In contrast, this paper proposes a decentralized algorithm, called DESTINY, which only invokes a single round of communications per iteration. DESTINY combines gradient tracking techniques with a novel approximate augmented Lagrangian function. The global convergence to stationary points is rigorously established. Comprehensive numerical experiments demonstrate that DESTINY has a strong potential to deliver a cutting-edge performance in solving a variety of testing problems.
    Application of Hierarchical Temporal Memory Theory for Document Categorization. (arXiv:2112.14820v1 [cs.CL])
    (2 min) The current work intends to study the performance of the Hierarchical Temporal Memory(HTM) theory for automated classification of text as well as documents. HTM is a biologically inspired theory based on the working principles of the human neocortex. The current study intends to provide an alternative framework for document categorization using the Spatial Pooler learning algorithm in the HTM Theory. As HTM accepts only a stream of binary data as input, Latent Semantic Indexing(LSI) technique is used for extracting the top features from the input and converting them into binary format. The Spatial Pooler algorithm converts the binary input into sparse patterns with similar input text having overlapping spatial patterns making it easy for classifying the patterns into categories. The results obtained prove that HTM theory, although is in its nascent stages, performs at par with most of the popular machine learning based classifiers.
    The SAMME.C2 algorithm for severely imbalanced multi-class classification. (arXiv:2112.14868v1 [stat.ML])
    (2 min) Classification predictive modeling involves the accurate assignment of observations in a dataset to target classes or categories. There is an increasing growth of real-world classification problems with severely imbalanced class distributions. In this case, minority classes have much fewer observations to learn from than those from majority classes. Despite this sparsity, a minority class is often considered the more interesting class yet developing a scientific learning algorithm suitable for the observations presents countless challenges. In this article, we suggest a novel multi-class classification algorithm specialized to handle severely imbalanced classes based on the method we refer to as SAMME.C2. It blends the flexible mechanics of the boosting techniques from SAMME algorithm, a multi-class classifier, and Ada.C2 algorithm, a cost-sensitive binary classifier designed to address highly class imbalances. Not only do we provide the resulting algorithm but we also establish scientific and statistical formulation of our proposed SAMME.C2 algorithm. Through numerical experiments examining various degrees of classifier difficulty, we demonstrate consistent superior performance of our proposed model.
    A general technique for the estimation of farm animal body part weights from CT scans and its applications in a rabbit breeding program. (arXiv:2112.15095v1 [cs.CV])
    (2 min) Various applications of farm animal imaging are based on the estimation of weights of certain body parts and cuts from the CT images of animals. In many cases, the complexity of the problem is increased by the enormous variability of postures in CT images due to the scanning of non-sedated, living animals. In this paper, we propose a general and robust approach for the estimation of the weights of cuts and body parts from the CT images of (possibly) living animals. We adapt multi-atlas based segmentation driven by elastic registration and joint feature and model selection for the regression component to cape with the large number of features and low number of samples. The proposed technique is evaluated and illustrated through real applications in rabbit breeding programs, showing r^2 scores 12% higher than previous techniques and methods that used to drive the selection so far. The proposed technique is easily adaptable to similar problems, consequently, it is shared in an open source software package for the benefit of the community.
    Dimensionality reduction for prediction: Application to Bitcoin and Ethereum. (arXiv:2112.15036v1 [q-fin.ST])
    (2 min) The objective of this paper is to assess the performances of dimensionality reduction techniques to establish a link between cryptocurrencies. We have focused our analysis on the two most traded cryptocurrencies: Bitcoin and Ethereum. To perform our analysis, we took log returns and added some covariates to build our data set. We first introduced the pearson correlation coefficient in order to have a preliminary assessment of the link between Bitcoin and Ethereum. We then reduced the dimension of our data set using canonical correlation analysis and principal component analysis. After performing an analysis of the links between Bitcoin and Ethereum with both statistical techniques, we measured their performance on forecasting Ethereum returns with Bitcoin s features.
    Aim in Climate Change and City Pollution. (arXiv:2112.15115v1 [cs.LG])
    (2 min) The sustainability of urban environments is an increasingly relevant problem. Air pollution plays a key role in the degradation of the environment as well as the health of the citizens exposed to it. In this chapter we provide a review of the methods available to model air pollution, focusing on the application of machine-learning methods. In fact, machine-learning methods have proved to importantly increase the accuracy of traditional air-pollution approaches while limiting the development cost of the models. Machine-learning tools have opened new approaches to study air pollution, such as flow-dynamics modelling or remote-sensing methodologies.
    Are we really making much progress? Revisiting, benchmarking, and refining heterogeneous graph neural networks. (arXiv:2112.14936v1 [cs.LG])
    (2 min) Heterogeneous graph neural networks (HGNNs) have been blossoming in recent years, but the unique data processing and evaluation setups used by each work obstruct a full understanding of their advancements. In this work, we present a systematical reproduction of 12 recent HGNNs by using their official codes, datasets, settings, and hyperparameters, revealing surprising findings about the progress of HGNNs. We find that the simple homogeneous GNNs, e.g., GCN and GAT, are largely underestimated due to improper settings. GAT with proper inputs can generally match or outperform all existing HGNNs across various scenarios. To facilitate robust and reproducible HGNN research, we construct the Heterogeneous Graph Benchmark (HGB), consisting of 11 diverse datasets with three tasks. HGB standardizes the process of heterogeneous graph data splits, feature processing, and performance evaluation. Finally, we introduce a simple but very strong baseline Simple-HGN--which significantly outperforms all previous models on HGB--to accelerate the advancement of HGNNs in the future.
    Two Instances of Interpretable Neural Network for Universal Approximations. (arXiv:2112.15026v1 [cs.LG])
    (2 min) This paper proposes two bottom-up interpretable neural network (NN) constructions for universal approximation, namely Triangularly-constructed NN (TNN) and Semi-Quantized Activation NN (SQANN). The notable properties are (1) resistance to catastrophic forgetting (2) existence of proof for arbitrarily high accuracies on training dataset (3) for an input \(x\), users can identify specific samples of training data whose activation ``fingerprints" are similar to that of \(x\)'s activations. Users can also identify samples that are out of distribution.
    Binary Diffing as a Network Alignment Problem via Belief Propagation. (arXiv:2112.15337v1 [cs.LG])
    (2 min) In this paper, we address the problem of finding a correspondence, or matching, between the functions of two programs in binary form, which is one of the most common task in binary diffing. We introduce a new formulation of this problem as a particular instance of a graph edit problem over the call graphs of the programs. In this formulation, the quality of a mapping is evaluated simultaneously with respect to both function content and call graph similarities. We show that this formulation is equivalent to a network alignment problem. We propose a solving strategy for this problem based on max-product belief propagation. Finally, we implement a prototype of our method, called QBinDiff, and propose an extensive evaluation which shows that our approach outperforms state of the art diffing tools.
    Training Quantized Deep Neural Networks via Cooperative Coevolution. (arXiv:2112.14834v1 [cs.NE])
    (2 min) Quantizing deep neural networks (DNNs) has been a promising solution for deploying deep neural networks on embedded devices. However, most of the existing methods do not quantize gradients, and the process of quantizing DNNs still has a lot of floating-point operations, which hinders the further applications of quantized DNNs. To solve this problem, we propose a new heuristic method based on cooperative coevolution for quantizing DNNs. Under the framework of cooperative coevolution, we use the estimation of distribution algorithm to search for the low-bits weights. Specifically, we first construct an initial quantized network from a pre-trained network instead of random initialization and then start searching from it by restricting the search space. So far, the problem is the largest discrete problem known to be solved by evolutionary algorithms. Experiments show that our method can train 4 bit ResNet-20 on the Cifar-10 dataset without sacrificing accuracy.
    InverseMV: Composing Piano Scores with a Convolutional Video-Music Transformer. (arXiv:2112.15320v1 [cs.LG])
    (2 min) Many social media users prefer consuming content in the form of videos rather than text. However, in order for content creators to produce videos with a high click-through rate, much editing is needed to match the footage to the music. This posts additional challenges for more amateur video makers. Therefore, we propose a novel attention-based model VMT (Video-Music Transformer) that automatically generates piano scores from video frames. Using music generated from models also prevent potential copyright infringements that often come with using existing music. To the best of our knowledge, there is no work besides the proposed VMT that aims to compose music for video. Additionally, there lacks a dataset with aligned video and symbolic music. We release a new dataset composed of over 7 hours of piano scores with fine alignment between pop music videos and MIDI files. We conduct experiments with human evaluation on VMT, SeqSeq model (our baseline), and the original piano version soundtrack. VMT achieves consistent improvements over the baseline on music smoothness and video relevance. In particular, with the relevance scores and our case study, our model has shown the capability of multimodality on frame-level actors' movement for music generation. Our VMT model, along with the new dataset, presents a promising research direction toward composing the matching soundtrack for videos. We have released our code at https://github.com/linchintung/VMT
    SimSR: Simple Distance-based State Representation for Deep Reinforcement Learning. (arXiv:2112.15303v1 [cs.LG])
    (2 min) This work explores how to learn robust and generalizable state representation from image-based observations with deep reinforcement learning methods. Addressing the computational complexity, stringent assumptions, and representation collapse challenges in the existing work of bisimulation metric, we devise Simple State Representation (SimSR) operator, which achieves equivalent functionality while reducing the complexity by an order in comparison with bisimulation metric. SimSR enables us to design a stochastic-approximation-based method that can practically learn the mapping functions (encoders) from observations to latent representation space. Besides the theoretical analysis, we experimented and compared our work with recent state-of-the-art solutions in visual MuJoCo tasks. The results show that our model generally achieves better performance and has better robustness and good generalization.
    Learning Inception Attention for Image Synthesis and Image Recognition. (arXiv:2112.14804v1 [cs.CV])
    (2 min) Image synthesis and image recognition have witnessed remarkable progress, but often at the expense of computationally expensive training and inference. Learning lightweight yet expressive deep model has emerged as an important and interesting direction. Inspired by the well-known split-transform-aggregate design heuristic in the Inception building block, this paper proposes a Skip-Layer Inception Module (SLIM) that facilitates efficient learning of image synthesis models, and a same-layer variant (dubbed as SLIM too) as a stronger alternative to the well-known ResNeXts for image recognition. In SLIM, the input feature map is first split into a number of groups (e.g., 4).Each group is then transformed to a latent style vector(via channel-wise attention) and a latent spatial mask (via spatial attention). The learned latent masks and latent style vectors are aggregated to modulate the target feature map. For generative learning, SLIM is built on a recently proposed lightweight Generative Adversarial Networks (i.e., FastGANs) which present a skip-layer excitation(SLE) module. For few-shot image synthesis tasks, the proposed SLIM achieves better performance than the SLE work and other related methods. For one-shot image synthesis tasks, it shows stronger capability of preserving images structures than prior arts such as the SinGANs. For image classification tasks, the proposed SLIM is used as a drop-in replacement for convolution layers in ResNets (resulting in ResNeXt-like models) and achieves better accuracy in theImageNet-1000 dataset, with significantly smaller model complexity
    Resource-Efficient Deep Learning: A Survey on Model-, Arithmetic-, and Implementation-Level Techniques. (arXiv:2112.15131v1 [cs.LG])
    (2 min) Deep learning is pervasive in our daily life, including self-driving cars, virtual assistants, social network services, healthcare services, face recognition, etc. However, deep neural networks demand substantial compute resources during training and inference. The machine learning community has mainly focused on model-level optimizations such as architectural compression of deep learning models, while the system community has focused on implementation-level optimization. In between, various arithmetic-level optimization techniques have been proposed in the arithmetic community. This article provides a survey on resource-efficient deep learning techniques in terms of model-, arithmetic-, and implementation-level techniques and identifies the research gaps for resource-efficient deep learning techniques across the three different level techniques. Our survey clarifies the influence from higher to lower-level techniques based on our resource-efficiency metric definition and discusses the future trend for resource-efficient deep learning research.
    On Distinctive Properties of Universal Perturbations. (arXiv:2112.15329v1 [cs.LG])
    (2 min) We identify properties of universal adversarial perturbations (UAPs) that distinguish them from standard adversarial perturbations. Specifically, we show that targeted UAPs generated by projected gradient descent exhibit two human-aligned properties: semantic locality and spatial invariance, which standard targeted adversarial perturbations lack. We also demonstrate that UAPs contain significantly less signal for generalization than standard adversarial perturbations -- that is, UAPs leverage non-robust features to a smaller extent than standard adversarial perturbations.
    Pose Estimation of Specific Rigid Objects. (arXiv:2112.15075v1 [cs.CV])
    (3 min) In this thesis, we address the problem of estimating the 6D pose of rigid objects from a single RGB or RGB-D input image, assuming that 3D models of the objects are available. This problem is of great importance to many application fields such as robotic manipulation, augmented reality, and autonomous driving. First, we propose EPOS, a method for 6D object pose estimation from an RGB image. The key idea is to represent an object by compact surface fragments and predict the probability distribution of corresponding fragments at each pixel of the input image by a neural network. Each pixel is linked with a data-dependent number of fragments, which allows systematic handling of symmetries, and the 6D poses are estimated from the links by a RANSAC-based fitting method. EPOS outperformed all RGB and most RGB-D and D methods on several standard datasets. Second, we present HashMatch, an RGB-D method that slides a window over the input image and searches for a match against templates, which are pre-generated by rendering 3D object models in different orientations. The method applies a cascade of evaluation stages to each window location, which avoids exhaustive matching against all templates. Third, we propose ObjectSynth, an approach to synthesize photorealistic images of 3D object models for training methods based on neural networks. The images yield substantial improvements compared to commonly used images of objects rendered on top of random photographs. Fourth, we introduce T-LESS, the first dataset for 6D object pose estimation that includes 3D models and RGB-D images of industry-relevant objects. Fifth, we define BOP, a benchmark that captures the status quo in the field. BOP comprises eleven datasets in a unified format, an evaluation methodology, an online evaluation system, and public challenges held at international workshops organized at the ICCV and ECCV conferences.
    A Survey of Deep Learning Techniques for Dynamic Branch Prediction. (arXiv:2112.14911v1 [cs.AR])
    (2 min) Branch prediction is an architectural feature that speeds up the execution of branch instruction on pipeline processors and reduces the cost of branching. Recent advancements of Deep Learning (DL) in the post Moore's Law era is accelerating areas of automated chip design, low-power computer architectures, and much more. Traditional computer architecture design and algorithms could benefit from dynamic predictors based on deep learning algorithms which learns from experience by optimizing its parameters on large number of data. In this survey paper, we focus on traditional branch prediction algorithms, analyzes its limitations, and presents a literature survey of how deep learning techniques can be applied to create dynamic branch predictors capable of predicting conditional branch instructions. Prior surveys in this field focus on dynamic branch prediction techniques based on neural network perceptrons. We plan to improve the survey based on latest research in DL and advanced Machine Learning (ML) based branch predictors.
    Bayesian Algorithms Learn to Stabilize Unknown Continuous-Time Systems. (arXiv:2112.15094v1 [eess.SY])
    (2 min) Linear dynamical systems are canonical models for learning-based control of plants with uncertain dynamics. The setting consists of a stochastic differential equation that captures the state evolution of the plant understudy, while the true dynamics matrices are unknown and need to be learned from the observed data of state trajectory. An important issue is to ensure that the system is stabilized and destabilizing control actions due to model uncertainties are precluded as soon as possible. A reliable stabilization procedure for this purpose that can effectively learn from unstable data to stabilize the system in a finite time is not currently available. In this work, we propose a novel Bayesian learning algorithm that stabilizes unknown continuous-time stochastic linear systems. The presented algorithm is flexible and exposes effective stabilization performance after a remarkably short time period of interacting with the system.
  • stat.ML updates on arXiv.org

    Uniform-PAC Bounds for Reinforcement Learning with Linear Function Approximation. (arXiv:2106.11612v2 [cs.LG] UPDATED)
    (2 min) We study reinforcement learning (RL) with linear function approximation. Existing algorithms for this problem only have high-probability regret and/or Probably Approximately Correct (PAC) sample complexity guarantees, which cannot guarantee the convergence to the optimal policy. In this paper, in order to overcome the limitation of existing algorithms, we propose a new algorithm called FLUTE, which enjoys uniform-PAC convergence to the optimal policy with high probability. The uniform-PAC guarantee is the strongest possible guarantee for reinforcement learning in the literature, which can directly imply both PAC and high probability regret bounds, making our algorithm superior to all existing algorithms with linear function approximation. At the core of our algorithm is a novel minimax value function estimator and a multi-level partition scheme to select the training samples from historical observations. Both of these techniques are new and of independent interest.
    Bounding Wasserstein distance with couplings. (arXiv:2112.03152v2 [stat.CO] CROSS LISTED)
    (2 min) Markov chain Monte Carlo (MCMC) provides asymptotically consistent estimates of intractable posterior expectations as the number of iterations tends to infinity. However, in large data applications, MCMC can be computationally expensive per iteration. This has catalyzed interest in sampling methods such as approximate MCMC, which trade off asymptotic consistency for improved computational speed. In this article, we propose estimators based on couplings of Markov chains to assess the quality of such asymptotically biased sampling methods. The estimators give empirical upper bounds of the Wassertein distance between the limiting distribution of the asymptotically biased sampling method and the original target distribution of interest. We establish theoretical guarantees for our upper bounds and show that our estimators can remain effective in high dimensions. We apply our quality measures to stochastic gradient MCMC, variational Bayes, and Laplace approximations for tall data and to approximate MCMC for Bayesian logistic regression in 4500 dimensions and Bayesian linear regression in 50000 dimensions.
    Sparse Generalized Yule-Walker Estimation for Large Spatio-temporal Autoregressions with an Application to NO2 Satellite Data. (arXiv:2108.02864v2 [econ.EM] UPDATED)
    (2 min) We consider a high-dimensional model in which variables are observed over time and space. The model consists of a spatio-temporal regression containing a time lag and a spatial lag of the dependent variable. Unlike classical spatial autoregressive models, we do not rely on a predetermined spatial interaction matrix, but infer all spatial interactions from the data. Assuming sparsity, we estimate the spatial and temporal dependence fully data-driven by penalizing a set of Yule-Walker equations. This regularization can be left unstructured, but we also propose customized shrinkage procedures when observations originate from spatial grids (e.g. satellite images). Finite sample error bounds are derived and estimation consistency is established in an asymptotic framework wherein the sample size and the number of spatial units diverge jointly. Exogenous variables can be included as well. A simulation exercise shows strong finite sample performance compared to competing procedures. As an empirical application, we model satellite measured NO2 concentrations in London. Our approach delivers forecast improvements over a competitive benchmark and we discover evidence for strong spatial interactions.
    Towards the global vision of engagement of Generation Z at the workplace: Mathematical modeling. (arXiv:2112.15401v1 [econ.GN])
    (2 min) Correlation and cluster analyses (k-Means, Gaussian Mixture Models) were performed on Generation Z engagement surveys at the workplace. The clustering indicates relations between various factors that describe the engagement of employees. The most noticeable factors are a clear statement about the responsibilities at work, and challenging work. These factors are essential in practice. The results of this paper can be used in preparing better motivational systems aimed at Generation Z employees.
    A theory of independent mechanisms for extrapolation in generative models. (arXiv:2004.00184v2 [cs.LG] UPDATED)
    (2 min) Generative models can be trained to emulate complex empirical data, but are they useful to make predictions in the context of previously unobserved environments? An intuitive idea to promote such extrapolation capabilities is to have the architecture of such model reflect a causal graph of the true data generating process, such that one can intervene on each node independently of the others. However, the nodes of this graph are usually unobserved, leading to overparameterization and lack of identifiability of the causal structure. We develop a theoretical framework to address this challenging situation by defining a weaker form of identifiability, based on the principle of independence of mechanisms. We demonstrate on toy examples that classical stochastic gradient descent can hinder the model's extrapolation capabilities, suggesting independence of mechanisms should be enforced explicitly during training. Experiments on deep generative models trained on real world data support these insights and illustrate how the extrapolation capabilities of such models can be leveraged.
    Score-Based Generative Modeling with Critically-Damped Langevin Diffusion. (arXiv:2112.07068v2 [stat.ML] UPDATED)
    (2 min) Score-based generative models (SGMs) have demonstrated remarkable synthesis quality. SGMs rely on a diffusion process that gradually perturbs the data towards a tractable distribution, while the generative model learns to denoise. The complexity of this denoising task is, apart from the data distribution itself, uniquely determined by the diffusion process. We argue that current SGMs employ overly simplistic diffusions, leading to unnecessarily complex denoising processes, which limit generative modeling performance. Based on connections to statistical mechanics, we propose a novel critically-damped Langevin diffusion (CLD) and show that CLD-based SGMs achieve superior performance. CLD can be interpreted as running a joint diffusion in an extended space, where the auxiliary variables can be considered "velocities" that are coupled to the data variables as in Hamiltonian dynamics. We derive a novel score matching objective for CLD and show that the model only needs to learn the score function of the conditional distribution of the velocity given data, an easier task than learning scores of the data directly. We also derive a new sampling scheme for efficient synthesis from CLD-based diffusion models. We find that CLD outperforms previous SGMs in synthesis quality for similar network architectures and sampling compute budgets. We show that our novel sampler for CLD significantly outperforms solvers such as Euler--Maruyama. Our framework provides new insights into score-based denoising diffusion models and can be readily used for high-resolution image synthesis. Project page and code: https://nv-tlabs.github.io/CLD-SGM.
    When are Iterative Gaussian Processes Reliably Accurate?. (arXiv:2112.15246v1 [cs.LG])
    (2 min) While recent work on conjugate gradient methods and Lanczos decompositions have achieved scalable Gaussian process inference with highly accurate point predictions, in several implementations these iterative methods appear to struggle with numerical instabilities in learning kernel hyperparameters, and poor test likelihoods. By investigating CG tolerance, preconditioner rank, and Lanczos decomposition rank, we provide a particularly simple prescription to correct these issues: we recommend that one should use a small CG tolerance ($\epsilon \leq 0.01$) and a large root decomposition size ($r \geq 5000$). Moreover, we show that L-BFGS-B is a compelling optimizer for Iterative GPs, achieving convergence with fewer gradient updates.
    High Dimensional Optimization through the Lens of Machine Learning. (arXiv:2112.15392v1 [math.OC])
    (2 min) This thesis reviews numerical optimization methods with machine learning problems in mind. Since machine learning models are highly parametrized, we focus on methods suited for high dimensional optimization. We build intuition on quadratic models to figure out which methods are suited for non-convex optimization, and develop convergence proofs on convex functions for this selection of methods. With this theoretical foundation for stochastic gradient descent and momentum methods, we try to explain why the methods used commonly in the machine learning field are so successful. Besides explaining successful heuristics, the last chapter also provides a less extensive review of more theoretical methods, which are not quite as popular in practice. So in some sense this work attempts to answer the question: Why are the default Tensorflow optimizers included in the defaults?
    Reward-Free Model-Based Reinforcement Learning with Linear Function Approximation. (arXiv:2110.06394v2 [cs.LG] UPDATED)
    (2 min) We study the model-based reward-free reinforcement learning with linear function approximation for episodic Markov decision processes (MDPs). In this setting, the agent works in two phases. In the exploration phase, the agent interacts with the environment and collects samples without the reward. In the planning phase, the agent is given a specific reward function and uses samples collected from the exploration phase to learn a good policy. We propose a new provably efficient algorithm, called UCRL-RFE under the Linear Mixture MDP assumption, where the transition probability kernel of the MDP can be parameterized by a linear function over certain feature mappings defined on the triplet of state, action, and next state. We show that to obtain an $\epsilon$-optimal policy for arbitrary reward function, UCRL-RFE needs to sample at most $\tilde{\mathcal{O}}(H^5d^2\epsilon^{-2})$ episodes during the exploration phase. Here, $H$ is the length of the episode, $d$ is the dimension of the feature mapping. We also propose a variant of UCRL-RFE using Bernstein-type bonus and show that it needs to sample at most $\tilde{\mathcal{O}}(H^4d(H + d)\epsilon^{-2})$ to achieve an $\epsilon$-optimal policy. By constructing a special class of linear Mixture MDPs, we also prove that for any reward-free algorithm, it needs to sample at least $\tilde \Omega(H^2d\epsilon^{-2})$ episodes to obtain an $\epsilon$-optimal policy. Our upper bound matches the lower bound in terms of the dependence on $\epsilon$ and the dependence on $d$ if $H \ge d$.
    Infinite wide (finite depth) Neural Networks benefit from multi-task learning unlike shallow Gaussian Processes -- an exact quantitative macroscopic characterization. (arXiv:2112.15577v1 [cs.LG])
    (2 min) We prove in this paper that wide ReLU neural networks (NNs) with at least one hidden layer optimized with l2-regularization on the parameters enforces multi-task learning due to representation-learning - also in the limit width to infinity. This is in contrast to multiple other idealized settings discussed in the literature where wide (ReLU)-NNs loose their ability to benefit from multi-task learning in the limit width to infinity. We deduce the multi-task learning ability from proving an exact quantitative macroscopic characterization of the learned NN in function space.
    Dynamic Causal Effects Evaluation in A/B Testing with a Reinforcement Learning Framework. (arXiv:2002.01711v5 [cs.LG] UPDATED)
    (2 min) A/B testing, or online experiment is a standard business strategy to compare a new product with an old one in pharmaceutical, technological, and traditional industries. Major challenges arise in online experiments of two-sided marketplace platforms (e.g., Uber) where there is only one unit that receives a sequence of treatments over time. In those experiments, the treatment at a given time impacts current outcome as well as future outcomes. The aim of this paper is to introduce a reinforcement learning framework for carrying A/B testing in these experiments, while characterizing the long-term treatment effects. Our proposed testing procedure allows for sequential monitoring and online updating. It is generally applicable to a variety of treatment designs in different industries. In addition, we systematically investigate the theoretical properties (e.g., size and power) of our testing procedure. Finally, we apply our framework to both simulated data and a real-world data example obtained from a technological company to illustrate its advantage over the current practice. A Python implementation of our test is available at https://github.com/callmespring/CausalRL.
    G-PATE: Scalable Differentially Private Data Generator via Private Aggregation of Teacher Discriminators. (arXiv:1906.09338v2 [cs.LG] UPDATED)
    (2 min) Recent advances in machine learning have largely benefited from the massive accessible training data. However, large-scale data sharing has raised great privacy concerns. In this work, we propose a novel privacy-preserving data Generative model based on the PATE framework (G-PATE), aiming to train a scalable differentially private data generator that preserves high generated data utility. Our approach leverages generative adversarial nets to generate data, combined with private aggregation among different discriminators to ensure strong privacy guarantees. Compared to existing approaches, G-PATE significantly improves the use of privacy budgets. In particular, we train a student data generator with an ensemble of teacher discriminators and propose a novel private gradient aggregation mechanism to ensure differential privacy on all information that flows from teacher discriminators to the student generator. In addition, with random projection and gradient discretization, the proposed gradient aggregation mechanism is able to effectively deal with high-dimensional gradient vectors. Theoretically, we prove that G-PATE ensures differential privacy for the data generator. Empirically, we demonstrate the superiority of G-PATE over prior work through extensive experiments. We show that G-PATE is the first work being able to generate high-dimensional image data with high data utility under limited privacy budgets ($\epsilon \le 1$). Our code is available at https://github.com/AI-secure/G-PATE.
    Modelling matrix time series via a tensor CP-decomposition. (arXiv:2112.15423v1 [stat.ME])
    (2 min) We propose to model matrix time series based on a tensor CP-decomposition. Instead of using an iterative algorithm which is the standard practice for estimating CP-decompositions, we propose a new and one-pass estimation procedure based on a generalized eigenanalysis constructed from the serial dependence structure of the underlying process. A key idea of the new procedure is to project a generalized eigenequation defined in terms of rank-reduced matrices to a lower-dimensional one with full-ranked matrices, to avoid the intricacy of the former of which the number of eigenvalues can be zero, finite and infinity. The asymptotic theory has been established under a general setting without the stationarity. It shows, for example, that all the component coefficient vectors in the CP-decomposition are estimated consistently with the different error rates, depending on the relative sizes between the dimensions of time series and the sample size. The proposed model and the estimation method are further illustrated with both simulated and real data; showing effective dimension-reduction in modelling and forecasting matrix time series.
    Efficient Robust Training via Backward Smoothing. (arXiv:2010.01278v2 [cs.LG] UPDATED)
    (2 min) Adversarial training is so far the most effective strategy in defending against adversarial examples. However, it suffers from high computational costs due to the iterative adversarial attacks in each training step. Recent studies show that it is possible to achieve fast Adversarial Training by performing a single-step attack with random initialization. However, such an approach still lags behind state-of-the-art adversarial training algorithms on both stability and model robustness. In this work, we develop a new understanding towards Fast Adversarial Training, by viewing random initialization as performing randomized smoothing for better optimization of the inner maximization problem. Following this new perspective, we also propose a new initialization strategy, backward smoothing, to further improve the stability and model robustness over single-step robust training methods. Experiments on multiple benchmarks demonstrate that our method achieves similar model robustness as the original TRADES method while using much less training time ($\sim$3x improvement with the same training schedule).
    Bayesian Algorithms Learn to Stabilize Unknown Continuous-Time Systems. (arXiv:2112.15094v1 [eess.SY])
    (2 min) Linear dynamical systems are canonical models for learning-based control of plants with uncertain dynamics. The setting consists of a stochastic differential equation that captures the state evolution of the plant understudy, while the true dynamics matrices are unknown and need to be learned from the observed data of state trajectory. An important issue is to ensure that the system is stabilized and destabilizing control actions due to model uncertainties are precluded as soon as possible. A reliable stabilization procedure for this purpose that can effectively learn from unstable data to stabilize the system in a finite time is not currently available. In this work, we propose a novel Bayesian learning algorithm that stabilizes unknown continuous-time stochastic linear systems. The presented algorithm is flexible and exposes effective stabilization performance after a remarkably short time period of interacting with the system.
    Decentralized Optimization Over the Stiefel Manifold by an Approximate Augmented Lagrangian Function. (arXiv:2112.14949v1 [math.OC])
    (2 min) In this paper, we focus on the decentralized optimization problem over the Stiefel manifold, which is defined on a connected network of $d$ agents. The objective is an average of $d$ local functions, and each function is privately held by an agent and encodes its data. The agents can only communicate with their neighbors in a collaborative effort to solve this problem. In existing methods, multiple rounds of communications are required to guarantee the convergence, giving rise to high communication costs. In contrast, this paper proposes a decentralized algorithm, called DESTINY, which only invokes a single round of communications per iteration. DESTINY combines gradient tracking techniques with a novel approximate augmented Lagrangian function. The global convergence to stationary points is rigorously established. Comprehensive numerical experiments demonstrate that DESTINY has a strong potential to deliver a cutting-edge performance in solving a variety of testing problems.
    Benign Overfitting in Adversarially Robust Linear Classification. (arXiv:2112.15250v1 [cs.LG])
    (2 min) "Benign overfitting", where classifiers memorize noisy training data yet still achieve a good generalization performance, has drawn great attention in the machine learning community. To explain this surprising phenomenon, a series of works have provided theoretical justification in over-parameterized linear regression, classification, and kernel methods. However, it is not clear if benign overfitting still occurs in the presence of adversarial examples, i.e., examples with tiny and intentional perturbations to fool the classifiers. In this paper, we show that benign overfitting indeed occurs in adversarial training, a principled approach to defend against adversarial examples. In detail, we prove the risk bounds of the adversarially trained linear classifier on the mixture of sub-Gaussian data under $\ell_p$ adversarial perturbations. Our result suggests that under moderate perturbations, adversarially trained linear classifiers can achieve the near-optimal standard and adversarial risks, despite overfitting the noisy training data. Numerical experiments validate our theoretical findings.
    Data-driven advice for interpreting local and global model predictions in bioinformatics problems. (arXiv:2108.06201v2 [stat.ML] UPDATED)
    (2 min) Tree-based algorithms such as random forests and gradient boosted trees continue to be among the most popular and powerful machine learning models used across multiple disciplines. The conventional wisdom of estimating the impact of a feature in tree based models is to measure the \textit{node-wise reduction of a loss function}, which (i) yields only global importance measures and (ii) is known to suffer from severe biases. Conditional feature contributions (CFCs) provide \textit{local}, case-by-case explanations of a prediction by following the decision path and attributing changes in the expected output of the model to each feature along the path. However, Lundberg et al. pointed out a potential bias of CFCs which depends on the distance from the root of a tree. The by now immensely popular alternative, SHapley Additive exPlanation (SHAP) values appear to mitigate this bias but are computationally much more expensive. Here we contribute a thorough comparison of the explanations computed by both methods on a set of 164 publicly available classification problems in order to provide data-driven algorithm recommendations to current researchers. For random forests, we find extremely high similarities and correlations of both local and global SHAP values and CFC scores, leading to very similar rankings and interpretations. Analogous conclusions hold for the fidelity of using global feature importance scores as a proxy for the predictive power associated with each feature.
    From Twitter to Traffic Predictor: Next-Day Morning Traffic Prediction Using Social Media Data. (arXiv:2009.13794v3 [cs.SI] UPDATED)
    (3 min) The effectiveness of traditional traffic prediction methods is often extremely limited when forecasting traffic dynamics in early morning. The reason is that traffic can break down drastically during the early morning commute, and the time and duration of this break-down vary substantially from day to day. Early morning traffic forecast is crucial to inform morning-commute traffic management, but they are generally challenging to predict in advance, particularly by midnight. In this paper, we propose to mine Twitter messages as a probing method to understand the impacts of people's work and rest patterns in the evening/midnight of the previous day to the next-day morning traffic. The model is tested on freeway networks in Pittsburgh as experiments. The resulting relationship is surprisingly simple and powerful. We find that, in general, the earlier people rest as indicated from Tweets, the more congested roads will be in the next morning. The occurrence of big events in the evening before, represented by higher or lower tweet sentiment than normal, often implies lower travel demand in the next morning than normal days. Besides, people's tweeting activities in the night before and early morning are statistically associated with congestion in morning peak hours. We make use of such relationships to build a predictive framework which forecasts morning commute congestion using people's tweeting profiles extracted by 5 am or as late as the midnight prior to the morning. The Pittsburgh study supports that our framework can precisely predict morning congestion, particularly for some road segments upstream of roadway bottlenecks with large day-to-day congestion variation. Our approach considerably outperforms those existing methods without Twitter message features, and it can learn meaningful representation of demand from tweeting profiles that offer managerial insights.
    Random cohort effects and age groups dependency structure for mortality modelling and forecasting: Mixed-effects time-series model approach. (arXiv:2112.15258v1 [stat.AP])
    (2 min) There have been significant efforts devoted to solving the longevity risk given that a continuous growth in population ageing has become a severe issue for many developed countries over the past few decades. The Cairns-Blake-Dowd (CBD) model, which incorporates cohort effects parameters in its parsimonious design, is one of the most well-known approaches for mortality modelling at higher ages and longevity risk. This article proposes a novel mixed-effects time-series approach for mortality modelling and forecasting with considerations of age groups dependence and random cohort effects parameters. The proposed model can disclose more mortality data information and provide a natural quantification of the model parameters uncertainties with no pre-specified constraint required for estimating the cohort effects parameters. The abilities of the proposed approach are demonstrated through two applications with empirical male and female mortality data. The proposed approach shows remarkable improvements in terms of forecast accuracy compared to the CBD model in the short-, mid-and long-term forecasting using mortality data of several developed countries in the numerical examples.
    Binary Diffing as a Network Alignment Problem via Belief Propagation. (arXiv:2112.15337v1 [cs.LG])
    (2 min) In this paper, we address the problem of finding a correspondence, or matching, between the functions of two programs in binary form, which is one of the most common task in binary diffing. We introduce a new formulation of this problem as a particular instance of a graph edit problem over the call graphs of the programs. In this formulation, the quality of a mapping is evaluated simultaneously with respect to both function content and call graph similarities. We show that this formulation is equivalent to a network alignment problem. We propose a solving strategy for this problem based on max-product belief propagation. Finally, we implement a prototype of our method, called QBinDiff, and propose an extensive evaluation which shows that our approach outperforms state of the art diffing tools.
    Multiple Testing and Variable Selection along the path of the Least Angle Regression. (arXiv:1906.12072v4 [math.ST] UPDATED)
    (3 min) We investigate multiple testing and variable selection using the Least Angle Regression (LARS) algorithm in high dimensions under the assumption of Gaussian noise. LARS is known to produce a piecewise affine solution path with change points referred to as the knots of the LARS path. The key to our results is an expression in closed form of the exact joint law of a $K$-tuple of knots conditional on the variables selected by LARS, namely the so-called post-selection joint law of the LARS knots. Numerical experiments demonstrate the perfect fit of our findings. This paper makes three main contributions. First, we build testing procedures on variables entering the model along the LARS path in the general design case when the noise level can be unknown. These testing procedures are referred to as the Generalized $t$-Spacing tests (GtSt) and we prove that they have an exact non-asymptotic level (i.e., the Type I error is exactly controlled). This extends work of (Taylor et al., 2014) where the spacing test works for consecutive knots and known variance. Second, we introduce a new exact multiple false negatives test after model selection in the general design case when the noise level may be unknown. We prove that this testing procedure has exact non-asymptotic level for general design and unknown noise level. Third, we give an exact control of the false discovery rate under orthogonal design assumption. Monte Carlo simulations and a real data experiment are provided to illustrate our results in this case. Of independent interest, we introduce an equivalent formulation of the LARS algorithm based on a recursive function.
    Triangular Flows for Generative Modeling: Statistical Consistency, Smoothness Classes, and Fast Rates. (arXiv:2112.15595v1 [stat.ML])
    (2 min) Triangular flows, also known as Kn\"{o}the-Rosenblatt measure couplings, comprise an important building block of normalizing flow models for generative modeling and density estimation, including popular autoregressive flow models such as real-valued non-volume preserving transformation models (Real NVP). We present statistical guarantees and sample complexity bounds for triangular flow statistical models. In particular, we establish the statistical consistency and the finite sample convergence rates of the Kullback-Leibler estimator of the Kn\"{o}the-Rosenblatt measure coupling using tools from empirical process theory. Our results highlight the anisotropic geometry of function classes at play in triangular flows, shed light on optimal coordinate ordering, and lead to statistical guarantees for Jacobian flows. We conduct numerical experiments on synthetic data to illustrate the practical implications of our theoretical findings.
    Non Asymptotic Bounds for Optimization via Online Multiplicative Stochastic Gradient Descent. (arXiv:2112.07110v4 [stat.ML] UPDATED)
    (2 min) The gradient noise of Stochastic Gradient Descent (SGD) is considered to play a key role in its properties (e.g. escaping low potential points and regularization). Past research has indicated that the covariance of the SGD error done via minibatching plays a critical role in determining its regularization and escape from low potential points. It is however not much explored how much the distribution of the error influences the behavior of the algorithm. Motivated by some new research in this area, we prove universality results by showing that noise classes that have the same mean and covariance structure of SGD via minibatching have similar properties. We mainly consider the Multiplicative Stochastic Gradient Descent (M-SGD) algorithm as introduced by Wu et al., which has a much more general noise class than the SGD algorithm done via minibatching. We establish nonasymptotic bounds for the M-SGD algorithm mainly with respect to the Stochastic Differential Equation corresponding to SGD via minibatching. We also show that the M-SGD error is approximately a scaled Gaussian distribution with mean $0$ at any fixed point of the M-SGD algorithm. We also establish bounds for the convergence of the M-SGD algorithm in the strongly convex regime.
    Sufficient Statistic Memory AMP. (arXiv:2112.15327v1 [cs.IT])
    (2 min) Approximate message passing (AMP) is a promising technique for unknown signal reconstruction of certain high-dimensional linear systems with non-Gaussian signaling. A distinguished feature of the AMP-type algorithms is that their dynamics can be rigorously described by state evolution. However, state evolution does not necessarily guarantee the convergence of iterative algorithms. To solve the convergence problem of AMP-type algorithms in principle, this paper proposes a memory AMP (MAMP) under a sufficient statistic condition, named sufficient statistic MAMP (SS-MAMP). We show that the covariance matrices of SS-MAMP are L-banded and convergent. Given an arbitrary MAMP, we can construct an SS-MAMP by damping, which not only ensures the convergence of MAMP but also preserves the orthogonality of MAMP, i.e., its dynamics can be rigorously described by state evolution. As a byproduct, we prove that the Bayes-optimal orthogonal/vector AMP (BO-OAMP/VAMP) is an SS-MAMP. As a result, we reveal two interesting properties of BO-OAMP/VAMP for large systems: 1) the covariance matrices are L-banded and are convergent in BO-OAMP/VAMP, and 2) damping and memory are useless (i.e., do not bring performance improvement) in BO-OAMP/VAMP. As an example, we construct a sufficient statistic Bayes-optimal MAMP (BO-MAMP), which is Bayes optimal if its state evolution has a unique fixed point and its MSE is not worse than the original BO-MAMP. Finally, simulations are provided to verify the validity and accuracy of the theoretical results.
    The SAMME.C2 algorithm for severely imbalanced multi-class classification. (arXiv:2112.14868v1 [stat.ML])
    (2 min) Classification predictive modeling involves the accurate assignment of observations in a dataset to target classes or categories. There is an increasing growth of real-world classification problems with severely imbalanced class distributions. In this case, minority classes have much fewer observations to learn from than those from majority classes. Despite this sparsity, a minority class is often considered the more interesting class yet developing a scientific learning algorithm suitable for the observations presents countless challenges. In this article, we suggest a novel multi-class classification algorithm specialized to handle severely imbalanced classes based on the method we refer to as SAMME.C2. It blends the flexible mechanics of the boosting techniques from SAMME algorithm, a multi-class classifier, and Ada.C2 algorithm, a cost-sensitive binary classifier designed to address highly class imbalances. Not only do we provide the resulting algorithm but we also establish scientific and statistical formulation of our proposed SAMME.C2 algorithm. Through numerical experiments examining various degrees of classifier difficulty, we demonstrate consistent superior performance of our proposed model.
    How Much Over-parameterization Is Sufficient to Learn Deep ReLU Networks?. (arXiv:1911.12360v4 [cs.LG] UPDATED)
    (2 min) A recent line of research on deep learning focuses on the extremely over-parameterized setting, and shows that when the network width is larger than a high degree polynomial of the training sample size $n$ and the inverse of the target error $\epsilon^{-1}$, deep neural networks learned by (stochastic) gradient descent enjoy nice optimization and generalization guarantees. Very recently, it is shown that under certain margin assumptions on the training data, a polylogarithmic width condition suffices for two-layer ReLU networks to converge and generalize (Ji and Telgarsky, 2019). However, whether deep neural networks can be learned with such a mild over-parameterization is still an open question. In this work, we answer this question affirmatively and establish sharper learning guarantees for deep ReLU networks trained by (stochastic) gradient descent. In specific, under certain assumptions made in previous work, our optimization and generalization guarantees hold with network width polylogarithmic in $n$ and $\epsilon^{-1}$. Our results push the study of over-parameterized deep neural networks towards more practical settings.
    Studying the Interplay between Information Loss and Operation Loss in Representations for Classification. (arXiv:2112.15238v1 [cs.LG])
    (2 min) Information-theoretic measures have been widely adopted in the design of features for learning and decision problems. Inspired by this, we look at the relationship between i) a weak form of information loss in the Shannon sense and ii) the operation loss in the minimum probability of error (MPE) sense when considering a family of lossy continuous representations (features) of a continuous observation. We present several results that shed light on this interplay. Our first result offers a lower bound on a weak form of information loss as a function of its respective operation loss when adopting a discrete lossy representation (quantization) instead of the original raw observation. From this, our main result shows that a specific form of vanishing information loss (a weak notion of asymptotic informational sufficiency) implies a vanishing MPE loss (or asymptotic operational sufficiency) when considering a general family of lossy continuous representations. Our theoretical findings support the observation that the selection of feature representations that attempt to capture informational sufficiency is appropriate for learning, but this selection is a rather conservative design principle if the intended goal is achieving MPE in classification. Supporting this last point, and under some structural conditions, we show that it is possible to adopt an alternative notion of informational sufficiency (strictly weaker than pure sufficiency in the mutual information sense) to achieve operational sufficiency in learning.
    Neural Thompson Sampling. (arXiv:2010.00827v2 [cs.LG] UPDATED)
    (2 min) Thompson Sampling (TS) is one of the most effective algorithms for solving contextual multi-armed bandit problems. In this paper, we propose a new algorithm, called Neural Thompson Sampling, which adapts deep neural networks for both exploration and exploitation. At the core of our algorithm is a novel posterior distribution of the reward, where its mean is the neural network approximator, and its variance is built upon the neural tangent features of the corresponding neural network. We prove that, provided the underlying reward function is bounded, the proposed algorithm is guaranteed to achieve a cumulative regret of $\mathcal{O}(T^{1/2})$, which matches the regret of other contextual bandit algorithms in terms of total round number $T$. Experimental comparisons with other benchmark bandit algorithms on various data sets corroborate our theory.
    Optimal Model Averaging of Support Vector Machines in Diverging Model Spaces. (arXiv:2112.12961v2 [stat.ML] UPDATED)
    (2 min) Support vector machine (SVM) is a powerful classification method that has achieved great success in many fields. Since its performance can be seriously impaired by redundant covariates, model selection techniques are widely used for SVM with high dimensional covariates. As an alternative to model selection, significant progress has been made in the area of model averaging in the past decades. Yet no frequentist model averaging method was considered for SVM. This work aims to fill the gap and to propose a frequentist model averaging procedure for SVM which selects the optimal weight by cross validation. Even when the number of covariates diverges at an exponential rate of the sample size, we show asymptotic optimality of the proposed method in the sense that the ratio of its hinge loss to the lowest possible loss converges to one. We also derive the convergence rate which provides more insights to model averaging. Compared to model selection methods of SVM which require a tedious but critical task of tuning parameter selection, the model averaging method avoids the task and shows promising performances in the empirical studies.
    Hierarchical forecasting with a top-down alignment of independent level forecasts. (arXiv:2103.08250v4 [stat.ML] UPDATED)
    (2 min) Hierarchical forecasting with intermittent time series is a challenge in both research and empirical studies. Extensive research focuses on improving the accuracy of each hierarchy, especially the intermittent time series at bottom levels. Then hierarchical reconciliation could be used to improve the overall performance further. In this paper, we present a \emph{hierarchical-forecasting-with-alignment} approach that treats the bottom level forecasts as mutable to ensure higher forecasting accuracy on the upper levels of the hierarchy. We employ a pure deep learning forecasting approach N-BEATS for continuous time series at the top levels and a widely used tree-based algorithm LightGBM for the intermittent time series at the bottom level. The \emph{hierarchical-forecasting-with-alignment} approach is a simple yet effective variant of the bottom-up method, accounting for biases that are difficult to observe at the bottom level. It allows suboptimal forecasts at the lower level to retain a higher overall performance. The approach in this empirical study was developed by the first author during the M5 Forecasting Accuracy competition, ranking second place. The method is also business orientated and could benefit for business strategic planning.
    Separation of scales and a thermodynamic description of feature learning in some CNNs. (arXiv:2112.15383v1 [stat.ML])
    (2 min) Deep neural networks (DNNs) are powerful tools for compressing and distilling information. Due to their scale and complexity, often involving billions of inter-dependent internal degrees of freedom, exact analysis approaches often fall short. A common strategy in such cases is to identify slow degrees of freedom that average out the erratic behavior of the underlying fast microscopic variables. Here, we identify such a separation of scales occurring in over-parameterized deep convolutional neural networks (CNNs) at the end of training. It implies that neuron pre-activations fluctuate in a nearly Gaussian manner with a deterministic latent kernel. While for CNNs with infinitely many channels these kernels are inert, for finite CNNs they adapt and learn from data in an analytically tractable manner. The resulting thermodynamic theory of deep learning yields accurate predictions on several deep non-linear CNN toy models. In addition, it provides new ways of analyzing and understanding CNNs.
    Improved Algorithm for the Network Alignment Problem with Application to Binary Diffing. (arXiv:2112.15336v1 [cs.LG])
    (2 min) In this paper, we present a novel algorithm to address the Network Alignment problem. It is inspired from a previous message passing framework of Bayati et al. [2] and includes several modifications designed to significantly speed up the message updates as well as to enforce their convergence. Experiments show that our proposed model outperforms other state-of-the-art solvers. Finally, we propose an application of our method in order to address the Binary Diffing problem. We show that our solution provides better assignment than the reference differs in almost all submitted instances and outline the importance of leveraging the graphical structure of binary programs.
    Bayesian Optimization of Function Networks. (arXiv:2112.15311v1 [cs.LG])
    (2 min) We consider Bayesian optimization of the output of a network of functions, where each function takes as input the output of its parent nodes, and where the network takes significant time to evaluate. Such problems arise, for example, in reinforcement learning, engineering design, and manufacturing. While the standard Bayesian optimization approach observes only the final output, our approach delivers greater query efficiency by leveraging information that the former ignores: intermediate output within the network. This is achieved by modeling the nodes of the network using Gaussian processes and choosing the points to evaluate using, as our acquisition function, the expected improvement computed with respect to the implied posterior on the objective. Although the non-Gaussian nature of this posterior prevents computing our acquisition function in closed form, we show that it can be efficiently maximized via sample average approximation. In addition, we prove that our method is asymptotically consistent, meaning that it finds a globally optimal solution as the number of evaluations grows to infinity, thus generalizing previously known convergence results for the expected improvement. Notably, this holds even though our method might not evaluate the domain densely, instead leveraging problem structure to leave regions unexplored. Finally, we show that our approach dramatically outperforms standard Bayesian optimization methods in several synthetic and real-world problems.
    A Unified and Constructive Framework for the Universality of Neural Networks. (arXiv:2112.14877v1 [cs.LG])
    (2 min) One of the reasons that many neural networks are capable of replicating complicated tasks or functions is their universality property. The past few decades have seen many attempts in providing constructive proofs for single or class of neural networks. This paper is an effort to provide a unified and constructive framework for the universality of a large class of activations including most of existing activations and beyond. At the heart of the framework is the concept of neural network approximate identity. It turns out that most of existing activations are neural network approximate identity, and thus universal in the space of continuous of functions on compacta. The framework induces several advantages. First, it is constructive with elementary means from functional analysis, probability theory, and numerical analysis. Second, it is the first unified attempt that is valid for most of existing activations. Third, as a by product, the framework provides the first university proof for some of the existing activation functions including Mish, SiLU, ELU, GELU, and etc. Fourth, it discovers new activations with guaranteed universality property. Indeed, any activation\textemdash whose $\k$th derivative, with $\k$ being an integer, is integrable and essentially bounded\textemdash is universal. Fifth, for a given activation and error tolerance, the framework provides precisely the architecture of the corresponding one-hidden neural network with predetermined number of neuron, and the values of weights/biases.
    A sampling-based approach for efficient clustering in large datasets. (arXiv:2112.14793v1 [cs.LG])
    (2 min) We propose a simple and efficient clustering method for high-dimensional data with a large number of clusters. Our algorithm achieves high-performance by evaluating distances of datapoints with a subset of the cluster centres. Our contribution is substantially more efficient than k-means as it does not require an all to all comparison of data points and clusters. We show that the optimal solutions of our approximation are the same as in the exact solution. However, our approach is considerably more efficient at extracting these clusters compared to the state-of-the-art. We compare our approximation with the exact k-means and alternative approximation approaches on a series of standardised clustering tasks. For the evaluation, we consider the algorithmic complexity, including number of operations to convergence, and the stability of the results.
    Entropy Regularized Optimal Transport Independence Criterion. (arXiv:2112.15265v1 [stat.ML])
    (2 min) Optimal transport (OT) and its entropy regularized offspring have recently gained a lot of attention in both machine learning and AI domains. In particular, optimal transport has been used to develop probability metrics between probability distributions. We introduce in this paper an independence criterion based on entropy regularized optimal transport. Our criterion can be used to test for independence between two samples. We establish non-asymptotic bounds for our test statistic, and study its statistical behavior under both the null and alternative hypothesis. Our theoretical results involve tools from U-process theory and optimal transport theory. We present experimental results on existing benchmarks, illustrating the interest of the proposed criterion.
    Time varying regression with hidden linear dynamics. (arXiv:2112.14862v1 [math.ST])
    (2 min) We revisit a model for time-varying linear regression that assumes the unknown parameters evolve according to a linear dynamical system. Counterintuitively, we show that when the underlying dynamics are stable the parameters of this model can be estimated from data by combining just two ordinary least squares estimates. We offer a finite sample guarantee on the estimation error of our method and discuss certain advantages it has over Expectation-Maximization (EM), which is the main approach proposed by prior work.
  • Data Science Central

    Data Monetization Approach for B2B2C Industries
    (6 min) Most of the Big Data, Data Science and AI / ML success stories seem to revolve around Business-to-Consumer (B2C) industries.  The companies that we see leading the economic exploitation of data and analytics are primarily B2C companies such as Apple, Alphabet (Google), Microsoft, Amazon, and Facebook (Figure 1).…
    IoT Drives Growth of Intelligent Transportation Systems
    (3 min) The global IoT in intelligent transportation systems market is anticipated to be driven by the rising focus of industry players on research and development, aimed at enhancing the integrated IoT software in order to minimize the cost of operation associated with these tools. Furthermore, the increasing focus on data collection pertaining to…
  • AWS Machine Learning Blog

    Identity verification using Amazon Rekognition
    (9 min) In-person user identity verification is slow to scale, costly, and high friction for users. Machine learning (ML) powered facial recognition technology can enable online user identity verification. Amazon Rekognition offers pre-trained facial recognition capabilities that you can quickly add to your user onboarding and authentication workflows to verify opted-in users’ identities online. No ML expertise […]
  • Jay Alammar

    The Illustrated Retrieval Transformer
    (5 min) Discussion: Discussion Thread for comments, corrections, or any feedback. Summary: The latest batch of language models can be much smaller yet achieve GPT-3 like performance by being able to query a database or search the web for information. A key indication is that building larger and larger models is not the only way to improve performance. The last few years saw the rise of Large Language Models (LLMs) – machine learning models that rapidly improve how machines process and generate language. Some of the highlights since 2017 include: The original Transformer breaks previous performance records for machine translation. BERT popularizes the pre-training then finetuning process, as well as Transformer-based contextualized word embeddings. It then rapidly starts to power Google Search and Bing Search. GPT-2 demonstrates the machine’s ability to write as well as humans do. First T5, then T0 push the boundaries of transfer learning (training a model on one task, and then having it do well on other adjacent tasks) and posing a lot of different tasks as text-to-text tasks. GPT-3 showed that massive scaling of generative models can lead to shocking emergent applications (the industry continues to train larger models like Gopher, MT-NLG…etc). For a while, it seemed like scaling larger and larger models is the main way to improve performance. Recent developments in the field, like DeepMind’s RETRO Transformer and OpenAI’s WebGPT, reverse this trend by showing that smaller generative language models can perform on par with massive models if we augment them with a way to search/query for information. This article breaks down DeepMind’s RETRO (Retrieval-Enhanced TRansfOrmer) and how it works. The model performs on par with GPT-3 despite being 4% its size (7.5 billion parameters vs. 185 billion for GPT-3 Da Vinci). RETRO incorporates information retrieved from a database to free its parameters from being an expensive store of facts and world knowledge. RETRO was presented in the paper Improving Language Models by Retrieving from Trillions of Tokens. It continues and builds on a wide variety of retrieval work in the research community. This article explains the model and not what is especially novel about it.
  • Artificial Intelligence

    Amazon Research Introduces Deep Reinforcement Learning For NLU Ranking Tasks
    (1 min) In recent years, voice-based virtual assistants such as Google Assistant and Amazon Alexa have grown popular. This has presented both potential and challenges for natural language understanding (NLU) systems. These devices’ production systems are often trained by supervised learning and rely significantly on annotated data. But, data annotation is costly and time-consuming. Furthermore, model updates using offline supervised learning can take long and miss trending requests. In the underlying architecture of voice-based virtual assistants, the NLU model often categorizes user requests into hypotheses for downstream applications to fulfill. A hypothesis comprises two tags: user intention (intent) and Named Entity Recognition (NER). For example, the valid hypothesis for “play a Madonna song” will be: PlaySong intent, ArtistName – Madonna. A new Amazon research introduces deep reinforcement learning strategies for NLU ranking. Their work analyses a ranking question in an NLU system in which entirely independent domain experts generate hypotheses with their features, where a domain is a functional area such as Music, Shopping, or Weather. These hypotheses are then ranked based on their scores, calculated based on their characteristics. As a result, the ranker must calibrate features from domain experts and select one hypothesis according to policy. Continue Reading Research Paper: https://assets.amazon.science/b3/74/77ff47044b69820c466f0624a0ab/introducing-deep-reinforcement-learning-to-nlu-ranking-tasks.pdf ​ https://preview.redd.it/27wx7281ui981.png?width=1920&format=png&auto=webp&s=61264372ce2854031e4acd1221c802a3e042a0a8 submitted by /u/ai-lover [link] [comments]
    AI Will Make It Difficult to Discern Information from Misinformation (1-minute audio clip from Eric Schmidt, former Google CEO)
    (1 min) submitted by /u/frog9913 [link] [comments]
    NLP: Hybridization of statistical approach and expert system?
    (1 min) Hi everyone! I have a question for you. For context, we aggregate on a platform the various AI APIs on the market (GCP, Azure, etc.) and including NLP APIs (keyword extraction, sentiment analysis, NER, etc.). The idea is that a developer doesn't have to create accounts with different providers and can have them all on one API to test, compare and change whenever he wants. However, many customers ask us how to mix the "statistical" approach behind these APIs with expert systems and how to achieve hybridization. Do you have any idea how to do this? Thanks, submitted by /u/tah_zem [link] [comments]
    Top 10 Object Detection APIs
    (0 min) submitted by /u/tah_zem [link] [comments]
    Open Domain Question Answering Part-1 [BlenderBot 2.0]
    (0 min) submitted by /u/coffeeroach [link] [comments]
  • Machine Learning

    [R] 🐸YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for everyone
    (1 min) YourTTS brings the power of a multilingual approach to the task of zero-shot multi-speaker TTS... it is possible to fine-tune the model with less than 1 minute of speech and achieve state-of-the-art results in voice similarity and with reasonable quality. 🤖 Demo: https://coqui.ai 👩‍💻 Code: https://github.com/coqui-ai/tts 🚀 Blogpost: https://coqui.ai/blog/tts/yourtts-zero-shot-text-synthesis-low-resource-languages 📎 Paper: https://arxiv.org/abs/2112.02418 submitted by /u/josh-r-meyer [link] [comments]
    [D] "why academia tends to under-invest in engineering infrastructure?"
    (1 min) Tweet from @jackclarkSF asks an interesting question: Is there a good paper that explains how/why academia tends to under-invest in engineering infrastructure? https://twitter.com/jackclarkSF/status/1478077579110207489 submitted by /u/MassivePellfish [link] [comments]
    [D] How to measure accuracy of kNN Imputation?
    (1 min) I have a dataset in which the the best way to impute missing values is to use kNN but before I go ahead and do that I'd like to check what kind of accuracy I have with that form of imputation in this specific dataset and which k should be used. My original solution was as follows: From my original dataset, remove all rows with missing values From this dataset, impute NaNs randomly throughout the dataset with the same frequency that they were missing originally and store the values that were replaced with NaN in a new dataset as the ground truth Impute using kNN Check the accuracy of the imputed values against the ground truth values stored in step 2 using MAE for different k values Is there an easier way to do this? If not, should I be using MAE or another accuracy score? submitted by /u/Ok-Culture-9123 [link] [comments]
    [D] Paper that mathematically proves that gradient descent can achieve zero training error.
    (2 min) I think this is a well-known paper, but I have not been able to find it. I am interested in the paper that mathematically proves neural nets can fit any set of datapoints. So far what I have found mostly is papers that show empirically that or something related to that, like this one. I'd appreciate any help. Edit: u/the_new_scientist shared this paper which is what I was looking for: https://arxiv.org/pdf/1810.02054.pdf Also, I apologize for my vague description. Now that the paper is shown, I hope it is more clear to future readers what kind of results I meant, but in case that is not case, I was wondering about this question: under what conditions can a neural network achieve zero training error? and in particular, I am interested in papers with mathematical (even without empirical) results. submitted by /u/carlml [link] [comments]
    [D] NLP: Hybridization of statistical approach and expert system ?
    (1 min) Hi everyone! I have a question for you. For context, we aggregate on a platform the various AI APIs on the market (GCP, Azure, etc.) and including NLP APIs (keyword extraction, sentiment analysis, NER, etc.). The idea is that a developer doesn't have to create accounts with different providers and can have them all on one API to test, compare and change whenever he wants. However, many customers ask us how to mix the "statistical" approach behind these APIs with expert systems and how to achieve hybridization. Do you have any idea how to do this? Thanks, submitted by /u/tah_zem [link] [comments]
    [P] Bringing serverless to ML - stateful, arbitrary dependency, serverless for ML
    (1 min) Serverless infrastructure is yet practical to use for ML but we think it could bring lots of benefits. So, with a friend, we decided to make serverless easy for ML and we are building a platform to solve the main issues we find in serverless for ML: - Stateful: We don´t want to reload a whole model every time a user calls model.predict - Arbritary dependecies: Normal python code with any package dependencies you use and love, just many many times in parallel - Scale-up and scale-down: scale up with ease and auto shutdown to keep resources consumptionYou can visit our webpage, try the demo, and request early access to use our platform! Webpage: https://telekinesis.cloud Happy to receive questions and comments on what we are building! submitted by /u/snuns90 [link] [comments]
    [D] What are your hopes for Machine Learning in 2022?
    (1 min) Hi r/MachineLearning! I was just wondering what some of you are hoping ML can accomplish or overcome in this new year - interested in hearing your thoughts! submitted by /u/DataGeek0 [link] [comments]
    [R] The Illustrated Retrieval Transformer (GPT3 performance at 4% the size)
    (4 min) Hi r/MachineLearning, I spent some time wrapping my head around DeepMind's Retro Transformer and visualizing how it works. Hope you find it useful. All feedback is welcome! http://jalammar.github.io/illustrated-retrieval-transformer/ submitted by /u/jayalammar [link] [comments]
    [D] What causes feature collapse?
    (1 min) For those of you unfamiliar feature collapse is when you train a model for classification and the model ends up mapping out-of-distribution data or data of different classes in very close proximity in multi-dimensional space. So for example once your model learns a cluster so to speak for cat, during test it projects a dog into the center of that cluster and classifies it as cat. Some ways to sort of deal with this in CV is double gradient penalty and spectral norm of resnet blocks, but what causes feature collapse? submitted by /u/DolantheMFWizard [link] [comments]
    [R] New paper: "A relational Tsetlin machine with applications to natural language understanding"
    (2 min) ​ Relational Tsetlin Machine The paper introduces the first Relational #TsetlinMachine, which reasons with relations, variables, and constants. The approach is based on first-order logic and Herbrand semantics, taking the first steps toward the computing power of a universal Turing machine. The approach can take advantage of logical structures appearing in natural language, to learn rules that represent how actions and consequences are related in the real world. The outcome is a logic program of Horn clauses, bringing in a structured view of unstructured data. In closed-domain question-answering, the first-order representation produces 10× more compact knowledge bases, along with an increase in answering accuracy from 94.83% to 99.48%. The approach is further robust towards erroneous, missing, and superfluous information, distilling the aspects of a text that are important for real-world understanding. https://link.springer.com/article/10.1007/s10844-021-00682-5 #ML #AI #NLP #MachineLearning #Logic #Relational submitted by /u/olegranmo [link] [comments]
    [D] How to deal with huge Categorical data
    (2 min) I have a dataset that already contains about 55 columns and out of this, around 10 Columns or so have categorical data in it. If I were to OneHotEncode them, I will end up having a column count of more than 300. Is this something advisable? How do you people deal with such huge number of columns? I mean 300 columns is not a big deal, but I would like to know your opinion and thoughts on this. submitted by /u/CaterpillarPrevious2 [link] [comments]
    [D] Spotted this post in LessWrong. Can anyone verify the rather fantastic claims being made here?
    (4 min) https://www.lesswrong.com/posts/rCP5iTYLtfcoC8NXd/self-organised-neural-networks-a-simple-natural-and#Roadmap The writing has some red flags, but it looks interesting enough. Having some trouble with my gpu drivers so I can't run it right now. submitted by /u/Their_bad_spellers [link] [comments]
    [R] A Neural Network Solves and Generates Mathematics Problems by Program Synthesis: Calculus, Differential Equations, Linear Algebra, and More
    (0 min) submitted by /u/shitboots [link] [comments]
    [D] Is there flow-based method which treats input data as different lengths each?
    (1 min) Hello. I am searching the researches that different size of data are generated through the flow-based network, not super-resolution task such as continuous mapping. I want to generate output as time-aligned scalar data, for example ​ Input: noise sampling (B x T x C) Output: scalar data (B x T, C=1) ​ with introducing the variational data augmentation technique (in vFlow, which can output high-dimensionality as concatenate noise vector for input and output both) for output. But there's a problem time dimension T is different for each of all data input. How can I treat this problem? ​ p.s. I am very appreciate if I can read the flow-based research in NLP task. submitted by /u/RedCuraceo [link] [comments]
    [D] Anyone switched from vision to robotics?
    (1 min) I’m about to finish my PhD and the whole field of robotics looks so exciting right now, especially applications like farming and recycling. Has anyone switched from more pure deep learning (vision / NLP) to robotics and how did it happen? Did you just get a robotics related job focusing on the vision side of things or is it key to have more experience on the robotics side before getting a job? Also I’m curious what’s the best location for robotics? Like how you go to Hong Kong / New York for finance, SF for software or Shenzhen for hardware. submitted by /u/temporary_ml_guy [link] [comments]
    [P] I like YOLOv5 but the code complexity is...
    (2 min) I like YOLOv5 but the code complexity is... I can't deny that YOLOv5 is a practical open-source object detection pipeline. However, the pain begins when adding new features or new experimental methods. Code dependencies are hard to follow which makes the code difficult to maintain. We wanted to try various experimental methods but hate to write one-time code that is never re-used. So we worked on making an object detection pipeline to have a better code structure so that we could continuously improve and add new features while easy to maintain. https://github.com/j-marple-dev/AYolov2 And we applied CI(Formating, Linting, Unittest) to ensure code quality with Docker support for development and inference. Our Docker supports the development environment with VIM. Our code design from the…
    [D] GUI-based Machine Learning applications?
    (1 min) I was previously using Azure Machine Learning Studio(classic), and of course, it was discontinued last month. Any other free ML applications? The new Azure Machine Learning Studio isn't free, and this is a school project so I'm aiming for free and simple. Any suggestions? Or maybe someone else is using Studio(classic) and knows a way around this? submitted by /u/max02c [link] [comments]
  • Neural Networks, Deep Learning and Machine Learning

    Interested in learning but not really sure where to start
    (1 min) For some time now I've been reading about neural networks and ai, the topic has really interested me a lot and I have seen amazing GAN projects like artbreeder, this has motivated me to want to work with this kind of technology, the only problem is that I can't really figure out where to start, so till this moment the only thing i've done is a Python introductory course. What should or could I do next? P.S: Spanish is my native language so excuse me if my english is a little rusty. submitted by /u/luisaalberto [link] [comments]
  • Reinforcement Learning

    The Best Machine Learning Courses on Udemy (2022)
    (0 min) submitted by /u/JanPrince002 [link] [comments]
    Amazon Research Introduces Deep Reinforcement Learning For NLU Ranking Tasks
    (1 min) In recent years, voice-based virtual assistants such as Google Assistant and Amazon Alexa have grown popular. This has presented both potential and challenges for natural language understanding (NLU) systems. These devices’ production systems are often trained by supervised learning and rely significantly on annotated data. But, data annotation is costly and time-consuming. Furthermore, model updates using offline supervised learning can take long and miss trending requests. In the underlying architecture of voice-based virtual assistants, the NLU model often categorizes user requests into hypotheses for downstream applications to fulfill. A hypothesis comprises two tags: user intention (intent) and Named Entity Recognition (NER). For example, the valid hypothesis for “play a Madonna song” will be: PlaySong intent, ArtistName – Madonna. A new Amazon research introduces deep reinforcement learning strategies for NLU ranking. Their work analyses a ranking question in an NLU system in which entirely independent domain experts generate hypotheses with their features, where a domain is a functional area such as Music, Shopping, or Weather. These hypotheses are then ranked based on their scores, calculated based on their characteristics. As a result, the ranker must calibrate features from domain experts and select one hypothesis according to policy. Continue Reading Research Paper: https://assets.amazon.science/b3/74/77ff47044b69820c466f0624a0ab/introducing-deep-reinforcement-learning-to-nlu-ranking-tasks.pdf ​ https://preview.redd.it/fp79wa92ui981.png?width=1920&format=png&auto=webp&s=647db1d2bd5be9dc9698c980e2146fef499368f3 submitted by /u/ai-lover [link] [comments]
    Average reward formulation for continuing settings
    (2 min) I have a problem that is -for its nature- continuing, i.e. there are no episodes and the agent has to operate for an infinite time. In my simulator, however, I have the possibility of simulating episodes of any length. Is the average reward formulation appropriate for such a problem? I do not know much about it and I have a few questions for you: Could you point me to some literature that addresses average reward and its pros/cons vs discounted reward for continuing problems? Is the average taken on a window of finite size or on the limit for the size that goes to infinity? How do common RL algorithms work in this formulation? Can we still adapt them to use average reward? submitted by /u/fedetask [link] [comments]
    Formula to compute loss in A3C
    (1 min) I'm a beginner to RL and I'm trying to understand how the loss function was computed. If it follows a specific formular. I've read the a3c algorithm overview on paper by barto but it seems the implemtation here https://github.com/MorvanZhou/pytorch-A3C/blob/master/discrete_A3C.py is different. submitted by /u/phissy08 [link] [comments]
    Introduction to Markov Decision Processes - Draft Chapters 2 and 3
    (1 min) submitted by /u/YouAgainShmidhoobuh [link] [comments]
    DQN with online learning
    (2 min) I saw another post on here that mentioned how DQN's can be implemented via online learning, offline learning (e.g. batching), or a combination (offline learning first and then online learning in production). What I'm confused at is the actual deployment paradigm to facilitate online learning (either as the first step or with hybrid). Is the full training code (with a checkpoint model if hybrid) and agent deployed to production with the training step logic executed on-demand for each state transition? If so, would this mean that model compilation / quantization is not possible (e.g. Onnx runtime to INT8 precision)? How is it scaled if it's having to perform SGD and back-propagation with each state transition? ​ One the other side, if the model is compiled / quantized, how does it implement online learning (if this is even possible)? submitted by /u/FinateAI [link] [comments]
    Any good papers for non-stationary environment?
    (1 min) I want to create a control system for control agents(players). The players will move non-stationary, so is there any good technique or papers to train this control system? Very Thanks. submitted by /u/Spiritual_Fig3632 [link] [comments]
    How to change MDP into POMDP? (getting pixel observations)
    (1 min) Hi, I'm looking into OpenAI robogym (https://github.com/openai/robogym)'s rearrange environments, and want to try RL-based control on it with pixel observations. The default observations are not pixels, and I can't seem to find any documentation online on how to change these environments into pixel observations. Can anyone point me to resources where similar changes have been done? Are there preexisting wrappers or code for similar mujoco environments anywhere? My other option is to simply render the environment with mode rgb_array, and then train the algorithm based on these rendered observations. Is this a viable option? submitted by /u/junkQn [link] [comments]
    Discrete Actions (OpenAI Hide & Seek)
    (1 min) Hello :), Happy new year to all of you! I have a question regarding the action space for the Hide & Seek paper by OpenAI. The architecture of the learning model they used is below. https://preview.redd.it/el54yihfjd981.png?width=794&format=png&auto=webp&s=33c5228ac8003b9bb89b6d008e2cf2aea5f7e071 They have eventually 3 action heads, as shown in the figure. As per the authors, the movement action for an agent "sets a discretized force along their x and y axis and torque around their z-axis". How do these 3 forces get pulled from the "movement action head", given that they are discretized? Does the movement action head have 3 outputs, each corresponding to an axis? So far, in my humble knowledge, I always assumed that a network with a discrete action space gives only one action per head (which could be the result of a softmax over the action space). Any insight? After some digging, I found that the Movement Action has a "MultiDiscrete" type, which is a gym class. I am struggling to understand the nature of actions produced by this class, any idea? submitted by /u/AhmedNizam_ [link] [comments]

2022-01-02

  • Data Science Central

    Artificial Intelligence for Mental Health
    (4 min) Happy new year! This year , let’s start with an issue that gained so much prominence over the last two years: mental health. A variety of AI techniques and strategies have been employed in mental health. In this post, we summarise the findings based on a paper (link below)   Summary The main predictor…
  • Artificial Intelligence

    I built an AI Discord bot that bans NFT Bros from my server
    (0 min) submitted by /u/TernaryJimbo [link] [comments]
    What can AI ‘discover’ in data that I was not expecting?
    (2 min) Help me understand this please. I’ve read that people feed data to an AI/ML algorithm and find things that they weren’t looking for but there aren’t many good examples out there to read about. So I’m thinking: if I were to feed into an AI/ML algorithm a csv of order details and customer data from an eCommerce system, should I expect the AI engine to “find” something statistically usual/unusual and tell me about it? Or do I need to instruct it to look for certain traits with e-commerce orders and customers that I already know about? Eg look out for fraud by checking x y z. What if the thing to look for is so strange that you need the AI to pick it up. Eg maybe fraudsters start telling their friends to always use a certain name or phone number as an inside joke. I would think an AI might pick on that somehow whereas it might take a human a longer time to figure that out. submitted by /u/rich_atl [link] [comments]
    Software 3.0: Prompt programming
    (0 min) submitted by /u/Respawne [link] [comments]
    Russian Bioinformaticians Have Created A Neural Network Architecture That Can Evaluate How Well An RNA Guide Has Been Chosen For Gene Editing
    (1 min) Genomic editing, particularly the CRISPR/Cas technique, is widely employed in experimental biology, agriculture, and biotechnology. CRISPR/Cas is one of several weapons used by bacteria to resist viruses. As the pathogen’s DNA enters the cell, Cas proteins detect it as foreign hereditary material and break it because its sequences differ from those of the bacteria. To respond to the virus quicker, the bacterium saves pieces of the pathogen’s DNA—much like a computer antivirus retains a collection of viral signatures—and passes them on to subsequent generations so that its Cas can prevent future attacks. Teams from different laboratories independently adapted the CRISPR/Cas system to introduce arbitrary changes into DNA sequences in human and animal cells. It made genomic editing much easier and more efficient. The critical components of the mechanism are guide RNA, which “marks the site,” and the Cas9 protein, which cleaves DNA at that location. The cell subsequently “heals the wound,” but the genetic code has already been altered. Quick Reading: https://www.marktechpost.com/2022/01/02/russian-bioinformaticians-have-created-a-neural-network-architecture-that-can-evaluate-how-well-an-rna-guide-has-been-chosen-for-gene-editing/ Paper: https://academic.oup.com/nar/advance-article/doi/10.1093/nar/gkab1065/6430490 submitted by /u/ai-lover [link] [comments]
    Tang Jie, the Tsinghua University professor leading the Wu Dao project, said in a recent interview that the group built 100 TRILLION parameter model in June, though it has not trained it to “convergence,” the point at which the model stops improving
    (1 min) submitted by /u/No-Transition-6630 [link] [comments]
    The top 10 AI/Computer Vision papers in 2021 with video demos, articles, and code for each!
    (0 min) submitted by /u/OnlyProggingForFun [link] [comments]
    What AIs are there which are able to edit images?
    (1 min) My dad loves technology but does not really understand it, I thought I take some pictures from him and AI should edit them so he looks like a simpsons etc. submitted by /u/xXLisa28Xx [link] [comments]
  • Machine Learning

    [R] A Comparison Of The Program Synthesis Performance Of GitHub Copilot And Genetic Programming
    (0 min) submitted by /u/EducationalCicada [link] [comments]
    [D] Machine Learning - WAYR (What Are You Reading) - Week 128
    (1 min) This is a place to share machine learning research papers, journals, and articles that you're reading this week. If it relates to what you're researching, by all means elaborate and give us your insight, otherwise it could just be an interesting paper you've read. Please try to provide some insight from your understanding and please don't post things which are present in wiki. Preferably you should link the arxiv page (not the PDF, you can easily access the PDF from the summary page but not the other way around) or any other pertinent links. Previous weeks : 1-10 11-20 21-30 31-40 41-50 51-60 61-70 71-80 81-90 91-100 101-110 111-120 121-130 Week 1 Week 11 Week 21 Week 31 Week 41 Week 51 Week 61 Week 71 Week 81 Week 91 Week 101 Week 111 Week 121 Week 2 Week 12 Week 22 Week 32…
    [D] ICLR 2022 Open Discussion Quality
    (1 min) First time submitting to ICLR, I'm wondering how is the open discussion on openreview.net really different from the review-rebuttal procedure used in other conferences. For the papers I'm reviewing, about half the reviewers reacted to the authors' responses (clarifications, modifications, additional experiments, etc.). As to my submission (5568), I run extra experiments and answered each of the concerns directly but received 0 feedback from the reviewers. As a reviewer, I think it doesn't matter if the rebuttal changes your mind about the quality of the submission but it's very basic manner to reply to the authors' responses. A simple "Thank the authors for the responses but I don't think these addressed my concerns" would work. Saying nothing only means you are uncertain if the responses make sense and you just doesn't care to figure it out. The authors spent a whole week running experiments to answer some of your questions and if you don't give **** about their responses, just keep the questions with you and don't submit your review. I was hoping for a different experience submitting to ICLR and then I realized the "discussion" is basically broken. submitted by /u/MLPRulesAll [link] [comments]
    [P] Tensorflow / Keras implementation of Vision Transformer https://arxiv.org/abs/2010.11929v2
    (0 min) An Image is Worth 16x16 Words: ViT Excellent results compared to SOTA CNNs while requiring fewer computational resources to train. Paper : https://arxiv.org/abs/2010.11929v2 Code : https://github.com/avinash31d/paper-implementations submitted by /u/avinash31d [link] [comments]
    [D] Simple Questions Thread
    (3 min) Please post your questions here instead of creating a new thread. Encourage others who create new posts for questions to post here instead! Thread will stay alive until next one so keep posting after the date in the title. Thanks to everyone for answering questions in the previous thread! submitted by /u/AutoModerator [link] [comments]
    [R] Learning 3D Representations from 2D Images
    (2 min) submitted by /u/pinter69 [link] [comments]
    [D] Paper Explained & Author Interview - Player of Games: All the games, one algorithm! (Video Walkthrough)
    (1 min) https://youtu.be/U0mxx7AoNz0 Special Guest: First author Martin Schmid (https://twitter.com/Lifrordi) Games have been used throughout research as testbeds for AI algorithms, such as reinforcement learning agents. However, different types of games usually require different solution approaches, such as AlphaZero for Go or Chess, and Counterfactual Regret Minimization (CFR) for Poker. Player of Games bridges this gap between perfect and imperfect information games and delivers a single algorithm that uses tree search over public information states, and is trained via self-play. The resulting algorithm can play Go, Chess, Poker, Scotland Yard, and many more games, as well as non-game environments. ​ OUTLINE: 0:00 - Introduction 2:50 - What games can Player of Games be trained on? 4:00 - Tree search algorithms (AlphaZero) 8:00 - What is different in imperfect information games? 15:40 - Counterfactual Value- and Policy-Networks 18:50 - The Player of Games search procedure 28:30 - How to train the network? 34:40 - Experimental Results 47:20 - Discussion & Outlook ​ Paper: https://arxiv.org/abs/2112.03178 submitted by /u/ykilcher [link] [comments]
    [Research][Project] The top 10 AI/Computer Vision papers in 2021 with video demos, articles, and code for each!
    (0 min) submitted by /u/OnlyProggingForFun [link] [comments]
    [D] Machine Learning Research
    (0 min) Hi everyone, I've compiled the trusted sources of ideation based on top-tier conferences on Machine Learning and Deep Learning worldwide. This repository includes datasets, tasks, state-of-the-art and more. Repository GitHub https://preview.redd.it/9p1jkei8c9981.png?width=1000&format=png&auto=webp&s=906c20c58b5ee569b25848a7fbf0b41ca4caf354 submitted by /u/tuanlda78202 [link] [comments]
    [P] Quick-Deploy - Optimize, convert and deploy machine learning models
    (0 min) Hello Reddit, releasing one of my OSS projects: Quick-Deploy .. github: https://github.com/rodrigobaron/quick-deploy blog post: https://rodrigobaron.com/posts/quick-deploy ​ It's in the very early stage, feel free to contribute or give a star 🙂 submitted by /u/rodrigobaron [link] [comments]
    [D] Raising errors while using accelerators
    (1 min) Why is it so hard to raise exceptions when ML pipeline is using a GPU? When for example you make a classic "Index out of bounds" error, in libraries like PyTorch you get some generic "CUDA" error and you can't see the exact error until you transfer the tensors explicitly to CPU and rerun the code. Do you think there is a possibility for this to improve in the future? Sorry if this is more CS-related question submitted by /u/Icy_Fisherman7187 [link] [comments]
    [D] Are NN actually overparametrized?
    (3 min) I often read that NN or CNN are overparametrized. But, for example, resnet18 has 11M parameters while cifar10 has 50k32323=153M data points. How is that be an overparametrized network on cifar10? Or even on mnist which has 60k28*28=47M data points submitted by /u/alesaso2000 [link] [comments]
    [D] Coding Practices
    (7 min) My job is to work with ML engineers and provide them with whatever they need to experiment with/train/test/deploy ML models -- GPU infrastructure, distributed training support, etc. When I interface with their code, I almost always find it so poorly written, with little to no thought given to long-term stability or use -- for code that they 100% know is going to production. They're brilliant people, far smarter than me, and really good at what they do, so it's not a matter of them not being good enough. I feel (from my very limited experience, so I'm happy to be wrong) like ML engineers are incentivized to write poor code. The only metric for evaluation seems to be accuracy, loss, and all the plots that come up. In research, I understand completely, that's where the focus lies, but in industry? I've seen many models perform poorly because the code is so hard to read and refactor that big issues remained unspotted for months together. And this is especially befuddling because for a field that is completely fine with spending months to get an ROI of single digit increases in model performance metrics during the experimentation phase, they don't seem to care about anything that might go wrong in production. That just feels like a fundamental disconnect, since without the core ML stuff working perfectly, none of the other stuff (like what I do) has any value -- and even so, I'm taught to hold my code to a much higher standard than the critical stuff -- which I'm happy about since I can now write production code by default -- but it's just... weird. Like the vending machines at a nuclear power plant being better engineered than the reactor. Is this a common problem or is this a localized issue that I'm facing? submitted by /u/vPyDev [link] [comments]
  • Neural Networks, Deep Learning and Machine Learning

    I built an AI Discord Bot that bans NFT Bros [Meme][Video]
    (0 min) submitted by /u/TernaryJimbo [link] [comments]
    [project] Credit Scoring
    (0 min) hey peeps, any research thesis regarding credit scoring in microfinance? submitted by /u/abschlusssss [link] [comments]
  • Reinforcement Learning

    "Player of Games", Schmid et al 2021 {DM} (generalizing AlphaZero to imperfect-information games)
    (1 min) submitted by /u/gwern [link] [comments]
    What is the best method/technology to deploy deep reinforcement learning models for robotics?
    (1 min) Hello, I am a robotics enthusiast building a robot arm to perform mundane tasks. So far, I have only been able to deploy deep reinforcement learning models on the Jetson Nano. The process has been easy using the Nvidia TensorRT SDK. What other methods/technology exist? My friend recommended using a google coral and a raspberry pi. Any recommendations would be greatly appreciated. Thank you submitted by /u/AINerd1 [link] [comments]
    Markov Decision Process explained through Memes
    (1 min) Today I published my first article on medium explaining Markov Decision Process with memes. Hope someone finds this helpful. https://medium.com/@saminbinkarim/explaining-markov-decision-process-with-memes-af679a0af343 submitted by /u/sardines_again [link] [comments]
    Proper way to count environment steps / frames in distributed RL architecture for algorithms like CLEAR or LASER => basically modified impala with replay
    (2 min) In classical - on-policy - vtrace/Impala algorithm env_steps are incremented every training iteration like this : env_steps += size_of_batch * unroll_length However when we add replay_buffer to vtrace (like in CLEAR or LASER) we also start to learn from off-policy data therefore some trajectories (from replay) could be sampled more than once. My question is - does the env_step_counting stay the same even when some trajectories from batch are from replay and some directly from workers (on-policy) ? Or do i count steps only on those trajectories that are on-policy => not from replay like : env_steps += len(batch["on_policy_samples"]) * unroll_length And if so what happens when i decide to use only samples from replay to train - how do i count env_steps then ? Using some kind of flag to indicate if specific trajectory was sampled already and leave it from counting ? ​ Have been digging through the research papers for some time now and i haven`t been able to find what is the proper way to do it. Any help or advice would be much appreciated. submitted by /u/ParradoxSVK [link] [comments]
    What should a beginner do after learning some basic ideas?
    (2 min) Hi all, it's my first time to use reddit to ask for suggestions. I am a beginner in RL. I learned some basic ideas in RL (such as SARSA, PPO and A2C) recently and applied Q-network to train a Go game agent. Personally speaking, I am going to pursue for a position as a RL engineer in the future. But I have no idea about what should I do to improve my RL skills. So I am wondering if there are some awesome resources for a beginner like me? Or could you please give me some instructions? submitted by /u/ZavierTi2021 [link] [comments]

2022-01-01

  • Data Science Central

    Building a hypergraph-based semantic knowledge sharing environment for construction
    (5 min) High-rise construction in the Denny Triangle area of Seattle. Source: Bruce Englehardt, Wikimedia Commons, May 2020 I've been looking for fresh use cases that tell the knowledge graph adoption story from new angles. Which industries, for instance, been having…
  • John D. Cook

    Architecture and Math
    (1 min) I recently received a review copy of Andrew Witt’s new book Formulations: Architecture, Mathematics, and Culture. The Hankel function on the cover is the first clue that this book contains some advanced mathematics. Or rather, it references some advanced mathematics. I’ve only skimmed the book so far, but I didn’t see any equations. Hankel functions […] Architecture and Math first appeared on John D. Cook.
    Turning the Golay problem sideways
    (3 min) I’ve written a couple posts about the Golay problem recently, first here then here. The problem is to find all values of N and n such that is a power of 2, say 2p. Most solutions apparently fall into three categories: n = 0 or n = N, N is odd and n = (N-1)/2, […] Turning the Golay problem sideways first appeared on John D. Cook.
  • Artificial Intelligence

    Hierarchical Federated Learning-Based Anomaly Detection Using Digital Twins For Internet of Medical Things (IoMT)
    (1 min) Smart healthcare services can be provided by using Internet of Things (IoT) technologies that monitor the health conditions of patients and their vital body parameters. The majority of IoT solutions used to enable such services are wearable devices, such as smartwatches, ECG monitors, and blood pressure monitors. The huge amount of data collected from smart medical devices leads to major security and privacy issues in the IoT domain. Considering Remote Patient Monitoring (RPM) applications, we will focus on Anomaly Detection (AD) models, whose purpose is to identify events that differ from the typical user behavior patterns. Generally, while designing centralized AD models, the researchers face security and privacy challenges (e.g., patient data privacy, training data poisoning). To overcome these issues, the researchers of this paper propose an Anomaly Detection (AD) model based on Federated Learning (FL). Federated Learning (FL) allows different devices to collaborate and perform training locally in order to build Anomaly Detection (AD) models without sharing patients’ data. Specifically, the researchers propose a hierarchical Federated Learning (FL) that enables collaboration among different organizations, by building various Anomaly Detection (AD) models for patients with similar health conditions. Continue Reading the Paper Summary: https://www.marktechpost.com/2022/01/01/hierarchical-federated-learning-based-anomaly-detection-using-digital-twins-for-internet-of-medical-things-iomt/ Full Paper: https://arxiv.org/pdf/2111.12241.pdf submitted by /u/ai-lover [link] [comments]
    Generate artistic images with OpenAI’s Glide 🖼
    (0 min) submitted by /u/tridoc [link] [comments]
    My Top 10 Computer Vision papers of 2021
    (0 min) submitted by /u/OnlyProggingForFun [link] [comments]
    Should we be concerned?
    (3 min) Should we be a little worried by how fast AI is developing? submitted by /u/Particular_Leader_16 [link] [comments]
    Is there a community/website/company which is collecting and categorizing AIs?
    (0 min) submitted by /u/xXNOdrugsForMEXx [link] [comments]
  • Machine Learning

    [D] Plug or Integrate a GNN Pytorch code base into Spark Cluster
    (1 min) Does anyone have a better explanation or resources to share for plug or Integrate a Pytorch based GNN models into Pyspark or similar cluster services? submitted by /u/SpiritMaleficent21 [link] [comments]
    [P] DeepCreamPy - Decensoring Hentai with Deep Neural Networks
    (0 min) submitted by /u/binaryfor [link] [comments]
    "[R]" Neuron outputs as weights
    (1 min) https://stats.stackexchange.com/questions/558864/what-if-weights-of-model-is-output-of-neurons submitted by /u/Dry_Introduction_897 [link] [comments]
    [D] Best Practices in Machine Learning
    (1 min) This is a non-profit that promotes best practices in machine learning, specifically for responsible ML. The practices are open source too, which is cool. Link here: https://www.fbpml.org/the-best-practices I think their technical best practices seems a little stronger than the organisational ones. Thoughts? ** this is their LinkedIn URL: https://www.linkedin.com/company/the-foundation-for-best-practices-in-machine-learning submitted by /u/Sbu91 [link] [comments]
    [Research] My Top 10 Computer Vision papers of 2021
    (0 min) submitted by /u/OnlyProggingForFun [link] [comments]
    [R] MT3: Multi-Task Multitrack Music Transcription
    (2 min) submitted by /u/Illustrious_Row_9971 [link] [comments]
    BERT Goes Shopping: Comparing Distributional Models for Product Representations (Paper Walkthrough) [D]
    (0 min) submitted by /u/prakhar21 [link] [comments]
  • Neural Networks, Deep Learning and Machine Learning

    Neural network drawing mushrooms
    (0 min) submitted by /u/noodlefist [link] [comments]
  • Reinforcement Learning

    NetHack 2021 NeurIPS Challenge -- winning agent episode visualizations
    (5 min) Hi All! I am Michał from the AutoAscend team that has won the NetHack 2021 NeurIPS Challenge. I have just shared some episode visualization videos: https://www.youtube.com/playlist?list=PLJ92BrynhLbdQVcz6-bUAeTeUo5i901RQ The winning agent isn't based on reinforcement learning in the end, but the victory of symbolic methods in this competition shows what RL is still missing to some extent -- so I believe this subreddit is a good place to discuss it. We hope that NLE will someday become a new standard benchmark for evaluation next to chess, go, Atari, etc. as it presents a set of whole new complex problems for agents to learn. Contrary to Atari, NetHack levels are procedurally generated, and therefore agents can't memorize the layout. Observations are highly partial, rewards are sparse, and episodes are usually very long. Here are some other useful links related to the competition: Full NeurIPS Session recording: https://www.youtube.com/watch?v=fVkXE330Bh0 AutoAscend team presentation starts here: https://youtu.be/fVkXE330Bh0?t=4437 Competition report: https://nethackchallenge.com/report.html AICrowd Challenge link: https://www.aicrowd.com/challenges/neurips-2021-the-nethack-challenge submitted by /u/procedural_only [link] [comments]
    Some questions on DRL
    (1 min) Hello, I´m applying Deep Reinforcement Learning for the first time, and I have some questions about it (I´ve already looked for an answer but in vain): How to normalize objectives' values in the reward function? if we have an objective that values are in the range of 10 and another objective that values are in the range of 1000. During the training phase, how can we watch the weights updates of a network and the gradient calculation too? In a multi-agent setting and episodic task, for "dones" vector, it will be set to "True" once all the agents are finished, or once an agent finishes the task done[agent_index]=True in other words, we won´t wait the latest agent to finish to set dones = [True]*number_of_agents Thank you. submitted by /u/GuavaAgreeable208 [link] [comments]
    Explained Variance suddenly crashes, and stay around zero and even sometimes go to negative values
    (2 min) Hi, all Happy New Year! I'm working on a project to teach my quad-rotor how to hover, and these are what my rollout graphs looks like. The reward graph constantly increases for some time, but suddenly during a rapid increase in reward, the entropy loss, and the train std of starts to diverge, and the explained variance just crashes. ​ https://preview.redd.it/yv4qbhmbc2981.png?width=826&format=png&auto=webp&s=31322a62963b1de47ed72452dfb36498ce222dd1 ​ https://preview.redd.it/d38papwdc2981.png?width=1373&format=png&auto=webp&s=0ab5b98d0bdad165f0dcfd7f5ea3e54ad965842c From my understanding, this explained variance far from 1 is a bad sign, Could you please point out, or give comments on what would be the possible reason why this is happening in my case? Any kind of reply would be appreciate. Thanks again Happy New Year. :D ​ Hp: ent_coef=0.0001, learning_rate=linear decay from 0.00004, n_steps=1024, batch_size=64 apart from these, I used the default params from stable_baselines3 I scale the action output from [-1, 1] to [min_rpm max_rpm] to actuate the quad-rotor. observations states = input = [pos,vel,acc,ang,ang_vel] // Network size 64,64 // algorithm used:PPO REWARD FUNCTION ​ https://preview.redd.it/o3w2i0r7s9981.png?width=818&format=png&auto=webp&s=f03b4f12d6c57f7b7aaec7dbd03b9fecad2c8f2c submitted by /u/GOMTAE [link] [comments]
    Help With PPO Model Performing Poorly
    (2 min) I am attempting to recreate the PPO algorithm to try to learn the inner workings of the algorithm better and to learn more about actor-critic reinforcement learning. So far, I have a model that seems to learn, just not very well. In the early stages of training, the algorithm seems more sporadic and may happen to find a pretty solid policy, but due to unstable the early parts of training are, it tends to move away from this policy. Eventually, the algorithm moves the policy toward a reward of around 30. For the past few commits in my repo where I have attempted to fix this issue, the policy always tends to the around 30 reward mark, and I'm not entirely sure why it's doing this. I'm thinking maybe I implemented the algorithm incorrectly, but I'm not certain. Can someone please help me with this issue? Below is a link to an image of training using the latest commit, using a previous commit, and my Github project current commit: https://ibb.co/JQgnq1f previous commit: https://ibb.co/rppVHKb GitHub: https://github.com/gmongaras/PPO_CartPole ​ Thanks for your help! submitted by /u/gmongaras [link] [comments]

2021-12-31

  • Data Science Central

    Lying to blockchains and other Web3 dilemmas
    (6 min) Semantic data consultant @EmekaOkoye made a couple of great points on Twitter back in November and early December 2021: “I am amused how some of us are pretending that #SmartAgent [tech] is not going to be a thing.” …
  • The Official NVIDIA Blog

    5 Ways AI Aimed to Improve the World in 2021
    (3 min) Not so long ago, searching for information could lead to a library to scan endless volumes or even tediously sift through microfilm. Clearly, technology is making the world a better place. Scientists, researchers, developers and companies have been on a quest to solve some of the world’s most pressing problems. Only now they’re accelerating their Read article > The post 5 Ways AI Aimed to Improve the World in 2021 appeared first on The Official NVIDIA Blog.
  • Blog

    Anomaly Detection with Isolation Forest and Kernel Density Estimation
    (7 min) Anomaly detection is to find data points that deviate from the norm. In other words, those are the points that […] The post Anomaly Detection with Isolation Forest and Kernel Density Estimation appeared first on Machine Learning Mastery.
  • Seita's Place

    Books Read in 2021
    (28 min) At the end of every year I have a tradition where I write summaries of the books that I read throughout the year. Here’s the following post with the rough set of categories: Popular Science (6 books) History, Government, Politics, Economics (6 books) Biographies / Memoirs (5 books) China (5 books) COVID-19 (2 books) Miscellaneous (7 books) I read 31 books this year. You can find the other blog posts from prior years (going back to 2016) in the blog archives. Popular Science This also includes popular science, which means the authors might not be technically trained as scientists. Who We Are and How We Got Here: Ancient DNA and the New Science of the Human Past (2018) is by famous geneticist and Harvard professor David Reich. Scientific advances in analyzing DNA have allowed better analysis…
  • John D. Cook

    Update on the Golay problem
    (2 min) I played around with what I’m calling the Golay problem over Christmas and wrote a blog post about it. I rewrote the post as I learned more about the problem due to experimentation and helpful feedback via comments and Twitter. In short, the Golay problem is to classify the values of N and n such […] Update on the Golay problem first appeared on John D. Cook.
  • Artificial Intelligence

    AI - A love story // AI-generated video about the future of AI // prompt -> GPT-J-6B -> Aphantasia
    (1 min) submitted by /u/6owline1vex [link] [comments]
    AI News in 2021: a Detailed Digest
    (0 min) https://lastweekin.ai/p/ai-news-in-2021-a-digest submitted by /u/regalalgorithm [link] [comments]
    I made a virtual Twitch Streamer, who responds to your chats using OpenAI and Text-to-Speech
    (1 min) submitted by /u/C0de_monkey [link] [comments]
    Happy New Year to everyone! Let's hope for a wonderful 2022. The text is created with polygons that evolve with an Evolutionary Algorithm.
    (1 min) submitted by /u/ValianTek_World [link] [comments]
    [R] Microsoft’s Self-Supervised Bug Detection and Repair Approach Betters Baselines By Up to 30%
    (1 min) In the NeurIPS 2021-accepted paper Self-Supervised Bug Detection and Repair, a Microsoft Research team proposes BUGLAB, a self-supervised approach that significantly improves on baseline methods for detecting bugs in real-life code. Here is a quick read: Microsoft’s Self-Supervised Bug Detection and Repair Approach Betters Baselines By Up to 30%. The code and PyPIBugs dataset are available on the project’s GitHub. The paper Self-Supervised Bug Detection and Repair is on arXiv. submitted by /u/Yuqing7 [link] [comments]
  • Machine Learning

    [P] Play around with StyleGAN2 in your browser
    (1 min) I built a little page to run and manipulate StyleGAN2 in the browser. https://ziyadedher.com/faces It was pretty fun learning about ONNX and how to port GANs to web. You can play around with the random seeds and also distort the intermediate latents to produce some really wacky results. You can check out a GIF on Twitter. Let me know if you come up with anything cool! submitted by /u/Cold999 [link] [comments]
  • Reinforcement Learning

    Agent not learning! Any Help
    (2 min) Hello Can someone explain why the actor critic maps the states to the same actions, in other words why the actor outputs the same action whatever the states? This what makes the agent learns nothing during training phase. Happy New Year! submitted by /u/LeatherCredit7148 [link] [comments]
    question about baseline with value network
    (1 min) I have a question about baseline in policy gradient method. Based on what is implemented in relation to PG, there are many shared policy and value networks. But I understood baseline should not affect the parameters of the policy. When PG with shared networks and baseline updates value function, then baseline will affect policy network. Is it okay? submitted by /u/Spiritual_Fig3632 [link] [comments]
2022-01-30T00:36:36.366Z osmosfeed 1.12.0